Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Detailed Action
This office action is responsive to the application filed on 26 July 2017.  Claims 1-20 are pending in the application.

Information Disclosure Statements
The information disclosure statements (IDS) submitted on 20 September 2018, 13 November 2018, 07 February 2019, and 11 September 2020 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statements are being considered by the Examiner.

Claim Objections
Claims, 2, 9, and 16 are objected to because of the following informalities:  
Dependent claims 2, 9, and 16 recite “the discriminative mode”, which has no antecedent basis within those claims or within any claim from which those claims depend.  It appears that the claims should instead recite “the discriminative model” (emphasis added), in reference to the recited “a discriminative model” in independent claims 1, 8, and 15.
Appropriate correction is required.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 15-20 are rejected under 35 U.S.C. § 101 because the claimed invention is directed to non-statutory subject matter.  The claims do not fall wholly within at least one of the four categories of patent eligible subject matter because the recitation in independent claim 15 of “[a]t least one machine-readable medium comprising instructions…” encompasses transitory “signals per se”.  Claims 16-20 recite “The machine-readable medium…” and do not cure this deficiency.
It is suggested to amend the claims to recite a “non-transitory” machine-readable medium.  Such amendment would find support in at least ¶¶ [0380] and [0387] of the instant specification.

Claim Rejections - 35 USC § 112
6.	Claims 3-4, 10-11, and 17-18 are rejected under 35 U.S.C. 112(a) as failing to comply with the written description requirement. The claims contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor had possession of the claimed invention.

	Claim 3 recites wherein the discriminative model logic is further to inverse the generative model into multiple inverse models, wherein a bidirectional connection is added to connect latent variables having a common parent in each of the multiple inverse models to consolidate the multiple inverse models into a single inverse model.
	Fig. 9A and ¶ [0196] of the instant disclosure depict and describe this inversion, where model 901 is inverted to create inverse models 903 and 905.  The parent nodes H1 and H2 in model 901 become child nodes in models 903 and 905, while child nodes X1, X2, and X3 in model 901 become parent nodes in models 903 and 905.  The description in the specification does not explain how the parent-child relationships are established in the inverse models 903 and 905.  For instance, in model 901, node X2 is a child of only node H2, yet in inverse model 903, node X2 has become a parent of both H1 and H2.  Likewise, in inverse model 905, node X1 has become a parent of both H1 and H2, while in the original model 901, X1 was a child of H1 alone.
	Without express rules for determining parent-child relationships in the created inverse models, one of ordinary skill in the art would not be informed as to how to determine those relationships, and would also be unable to add the claimed “bidirectional connection” to those nodes having a common parent.														
	Claim 4 depends directly from claim 3 and inherits its deficiencies as to the lack of written description.  Further, one of ordinary skill in the art would not be informed as to how to perform the claimed “removing the bidirectional connection” without a description of how to determine the parent-child relationships in the inverse models and the subsequent placement of the bidirectional link as described above in reference to claim 3.
	Claims 10-11 and 17-18 recite similar limitations as claims 3-4 and are rejected under 35 U.S.C § 112(a) under the same rationale as applied to claims 3-4 above.

Claim Rejections - 35 USC § 102
7.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
8.	The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


9.	Claims 1-2, 6, 8-9, 13, 15, and 16 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Patel et al. (US 2018/0082172, hereinafter “Patel”).

Regarding claim 1, Patel discloses [a]n apparatus comprising: detection/observation logic, as facilitated by or at least partially incorporated into a processor, (Patel, ¶ [0043] “In some embodiments, a computer system may be configured to include a processor (or a set of processors) and a memory medium, where the memory medium stores program instructions, where the processor is configured to read and execute the program instructions stored in the memory medium, 
to monitor and detect structure learning of neural networks relating to machine learning operations at the apparatus having the processor; (Patel, ¶ [0201] “Thus far, we have talked about estimating the right parameters of fixed architectures. But another problem is actually finding good architectures - ie. structure learning. […] In this section, armed with the DRM, we show how to infer such parameters using the EM algorithm.”; 
Patel, ¶ [0202] “Learning the Number of Filters in a Layer” (corresponds to claimed “structure learning”); Patel [0203] “For instance, consider the problem of determining the number of filters in a convolutional layer for a DCN. […] we will focus on the AIC and BIC scoring algorithms (45), which reward a trained model's goodness-of-fit (e.g. log-likelihood) and penalize its complexity (e.g. number of parameters).” ) [The system varies the structure and scores the variations, corresponding to the claimed “monitor and detect structure learning of neural networks”]; 
Patel, ¶ [0060] “It is important to note that our theory and methods apply to a wide range of inference tasks (including, for example, classification, estimation, regression, etc.) that feature a number of task-irrelevant nuisance variables (including, for example, object and speech recognition). However, for concreteness of exposition, we will focus below on the classification problem underlying visual object recognition. [corresponds to claimed “machine learning operations”]
generative model logic, as facilitated by or at least partially incorporated into the processor, to generate a recursive generative model based on one or more topologies of one or more of the neural networks; (Patel, ¶ [0065] “This section develops the RM, a generative probabilistic model that explicitly captures nuisance transformations as latent variables. We show how inference in the RM corresponds to operations in a single layer of a DCN. [the generative model is based on structure and operations in a Deep Convolutional Neural Network - corresponds to claimed “one or more topologies”] We then extend the RM by defining the DRM, a rendering model with layers representing different scales or levels of abstraction. Finally, we show that, after the application of a discriminative relaxation, inference and learning in the DRM correspond to feedforward propagation and back propagation training in the DCN.”; Patel, ¶ [0224] “In order to derive the update gate (or equivalently the LSTM forget gate), we first need to re-express the generative model recursively in the measurement time. We can do this by noting that [equations 46 and 47]” [The system of Patel is operable to re-express the generative model as a recursive generative model.]
and discriminative model logic, as facilitated by or at least partially incorporated into the processor, to convert the generative model into a discriminative model. (Patel, ¶ [0138] “In summary, starting with a generative classifier with learning objective Lgen(theta), we complete steps (a) through (e) to arrive at a discriminative classifier with learning objective Ldis(N). We refer to this process as a discriminative relaxation of a generative classifier and the resulting classifier is a discriminative counterpart to the generative classifier.”)

Claims 8 and 15 recite similar limitations as claim 1, and are rejected under the same rationale as applied to claim 1 above

Regarding claim 2, Patel as applied to claim 1 above discloses [t]he apparatus of claim 1.  Further, Patel discloses wherein the generative model is unsupervised and based on unlabeled data, (Patel, ¶ [0263] “Consider a task that is more difficult than One-Shot Learning, clustering unseen images. In this task, we are given a set of images without any category labels [i.e. unlabeled], and we must cluster them into meaningful/useful categories”;
Patel, ¶ [0264] “In order to solve this task, we can use the DRM representation [the generative Deep Recognition Model] from above. First, we can compute the abstract representations a(I) for all unseen images I. Then, we can perform unsupervised clustering on these abstract representations by fitting a traditional clustering model such as a GMM.”)
and wherein the discriminative model is supervised and based on labeled data, (Patel, ¶¶ [0148-49] “As such, the theory presented here makes a clear prediction that for a DCN [a discriminative Deep Convolutional Network], supervised learning of task targets will lead inevitably to unsupervised learning of latent task nuisance variables. From the perspective of manifold learning, this means that the architecture of DCNs is designed to learn and disentangle the intrinsic dimensions of the data manifold. In order to test this prediction, we trained a DCN to classify [a discriminative process] synthetically rendered images of naturalistic objects, such as cars and planes.”)
wherein the discriminative model is learned from the generative model. (Patel, ¶ [0138] “In summary, starting with a generative classifier with learning objective gen(theta), we complete steps (a) through (e) to arrive at a discriminative classifier with learning objective Ldis(N). We refer to this process as a discriminative relaxation of a generative classifier and the resulting classifier is a discriminative counterpart to the generative classifier.”)

	Claims 9 and 16 recite similar limitations as claim 2 and are rejected under the same rationale as applied to claim 2 above.

Regarding claim 6, Patel as applied to claim 1 above discloses [t]he apparatus of claim 1.  Patel further discloses further comprising: structure learning logic, as facilitated by or at least partially incorporated into the processor, to facilitate at least one of an end-to-end structure learning and a sub-network structure learning; (Patel, ¶ [0201] “Thus far, we have talked about estimating the right parameters of fixed architectures. But another problem is actually finding good architectures - ie. structure learning. […] In this section, armed with the DRM, we show how to infer such parameters using the EM algorithm.”; Patel, ¶ [0202] “Learning the Number of Filters in a Layer” (corresponds to claimed “sub-network structure learning”); Patel [0203] “For instance, consider the problem of determining the number of filters in a convolutional layer for a DCN. […] we will focus on the AIC and BIC scoring algorithms (45), which reward a trained model's goodness-of-fit (e.g. log-likelihood) and penalize its complexity (e.g. number of parameters).” )
and training and feature logic, as facilitated by or at least partially incorporated into the processor, to facilitate feature bagging or coping with large scale data by training large training sets. [Patel, [0055] “These so-called deep learning systems share two common hallmarks. First, architecturally, they are constructed from many layers of alternating linear and nonlinear processing units. Second, computationally, their parameters are learned using large-scale algorithms and massive amounts of training data [corresponds to claimed “coping with large scale data by training large data sets”]. Two examples of such architectures are: the deep convolutional neural network (DCN), which has seen great success in tasks like visual object recognition and localization (2), speech recognition (3), and part-of-speech recognition (4); and random decision forests (RDFs) (5) for image segmentation.” [The discriminative models created by Patel (DCN and RDF) are trained using large amounts of data.]

	Claim 13 recites similar limitations as claim 6 and is rejected under the same rationale as applied to claim 6 above.

Claim Rejections - 35 USC § 103
10.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

	
11.	Claims 5, 12, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Patel in view of Barrow et al., “Selective Dropout for Deep Neural Networks,” Barrow” and further in view of Chen et al., “Training Deep Nets with Sublinear Memory Cost,” arXiv:1604.06174v2 22 Apr 2016, hereinafter “Chen.”


Regarding claim 5, Patel as applied to claim 1 above discloses [t]he apparatus of claim 1.  Patel further discloses […] and on-the-fly learning/update logic, as facilitated by or at least partially incorporated into the processor, to perform on-the-fly learning and updating of network topologies of the neural networks based on at least one of currently available data and historically available data relating to the topologies of the neural networks.  (Patel, ¶¶ [0204-5], “Learning the Filter Sizes in a Layer” “We will use AIC criterion to score models with different filter sizes per layer and pick the best one.” [The system is operable to change the topology of the Deep Convolutional Network, using the AIC scores for various different topologies as a criterion.]

Patel does not disclose further comprising: dropout logic, as facilitated by or at least partially incorporated into the processor, to perform methodological dropout of neurons from one or more of the neural networks, wherein the methodological dropout is performed in accordance with a predictivity based on historical statistical data relating to the neurons
 further comprising: dropout logic, as facilitated by or at least partially incorporated into the processor, to perform methodological dropout of neurons from one or more of the neural networks, wherein the methodological dropout is performed in accordance with a predictivity based on historical statistical data relating to the neurons; (Barrow, Abstract “We present 3 new alternative methods for performing dropout on a deep neural network which improves the effectiveness of the dropout method over the same training period. These methods select neurons to be dropped through statistical values [corresponds to claimed “historical statistical data”] calculated using a neurons change in weight, the average size of a neuron’s weights, and the output variance of a neuron. We found that increasing the probability of dropping neurons with smaller values of these statistics and decreasing the probability of those with larger statistics gave an improved result in training over 10,000 epochs. The most effective of these was found to be the Output Variance method, giving an average improvement of 1.17% accuracy over traditional dropout methods.”;
Barrow, § 1 “Introduction,” We propose a new method of dropout that selectively chooses the best neurons (neurons which will have the biggest positive effect on the network if switched off) to be given a higher probability of being switched off on the assumption that dropout could be made more effective and efficient by not dropping neurons that should be forced to continue to learn.” [Neurons are switched off or left on depending on the magnitude of their effect on the network (corresponds to claimed “predictivity”)])


It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to modify the network training of Patel with the selective dropout of Barrow, the benefit being improved accuracy in the trained network, as cited by Barrow in the Abstract “The most effective of these was found to be the Output Variance method, giving an average improvement of 1.17% accuracy over traditional dropout methods.”

Patel further does not disclose decomposition logic, as facilitated by or at least partially incorporated into the processor, to generate parallel and sequential execution schedules for memory sharing at sub-network precision levels of the one or more of the neural networks;

Chen teaches decomposition logic, as facilitated by or at least partially incorporated into the processor, to generate parallel and sequential execution schedules for memory sharing at sub-network precision levels of the one or more of the neural networks; (Chen, Figure 1 “A Possible Allocation Plan” showing the various individual layers of a deep neural network during training [corresponds to claimed “sub-network precision levels”] “Memory allocation for each output op, same color indicates shared memory” (Figure 1, legend);
Chen, Figure 2, showing step-by-step memory allocation [corresponds to claimed “execution schedule”] for both shared and unshared memory, including a “Final Memory Plan”; 


	Chen is analogous art, as it is in the field of machine learning using deep neural networks.
	It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to incorporate the memory sharing of Chen with the deep neural networks of Patel, the benefit being reduced memory requirements, as cited by Chen in the Abstract “As many of the state-of-the-art models hit the upper bound of the GPU memory, our algorithm allows deeper and more complex models to be explored” And “Our experiments show that we can reduce the memory cost of a 1,000-layer deep residual network from 48G to 7G on ImageNet problems. Similarly, significant memory cost reduction is observed in training complex recurrent neural networks on very long sequences.”

	Claims 12 and 19 recite similar limitations as claim 5 and are rejected under the same rational as applied to claim 5 above.

Claims 7, 14, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Patel in view of Foley et al., “A Low-Power Integrated x86-64 and Graphics Processor for Mobile Computing Devices,” IEEE Journal of Solid-State Circuits, Vol. 47, No. 1, January 2012, hereinafter “Foley.”

Regarding claim 7, Patel as applied to claim 1 above discloses [t]he apparatus of claim 1.
Patel does not disclose wherein the processor comprises a graphics processor co-located with an application processor on a common semiconductor package.
Foley teaches wherein the processor comprises a graphics processor co-located with an application processor on a common semiconductor package. (Foley, Fig. 4 and § VI “FUSION BASICS”, “The traditional model of a processor chip (with integrated NB) coupled with an integrated graphics processor has a number of shortfalls. The high-speed PHY coupling the two processors (shown in red in Fig. 4) occupies significant area and consumes power. The power associated with the PHY can exceed 1 W during media playback. Additionally, the link may present a bandwidth bottleneck. When the two dies are integrated, a wide (256 bits in each direction) data path from the graphics memory controller to the NB is added, allowing for full access to system memory from the GMC. This path provides GMC clients with a low latency path to non-snooped regions of system memory, reducing the minimum read latency by up to 40%. Compared to two-chip solutions, use of the on-die integrated GPU significantly reduces memory latency, improves request ordering, and reduces area and power.”

	Foley is analogous art, as it is in the field of semiconductor processors and is directed to the feature of integrating a graphics processor with an application processor on a single chip.
	It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the claimed invention to integrate processors and graphics on the same chip, the benefit being that “[c]ompared to two-chip solutions, use of the on-die integrated GPU significantly reduces memory latency, improves request ordering, and reduces area and power” as cited by Foley in § VI.
	
	Claim 14 recites similar limitations as claim 7 and is rejected under the same rationale as applied to claim 78 above.

	Claim 20 recites similar limitations as claims 6 and 7, and is rejected under the same rationale as applied to claims 6 and 7 above.

Conclusion
13.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to SCOTT R GARDNER whose telephone number is (469) 295-9128.  The examiner can normally be reached on 8:00am - 5:00pm M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an 
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann J Lo can be reached on 571-272-9767.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/SCOTT R GARDNER/Examiner, Art Unit 2126   
/ANN J LO/Supervisory Patent Examiner, Art Unit 2126