Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Remarks
This Office Action is responsive to Applicants' Amendment filed on April 21, 2022, in which claims 1 and 9 are amended. Claims 1-20 are currently pending.

Response to Arguments
Applicant’s arguments with respect to the objection of claim 14 have been deemed persuasive.  The objections are hereby withdrawn, as necessitated by applicant's amendments and remarks made to the objections.
The rejections to claims 1,6,7, and 8 under 35 U.S.C. § 112(b)/(f) are hereby withdrawn, as necessitated by applicant's amendments and remarks made to the rejections.
The rejections to claims 1-20 under 35 U.S.C. § 101 are hereby withdrawn, as necessitated by applicant's amendments and remarks made to the rejections.
Applicant’s arguments with respect to rejection of claims 1-20 under 35 U.S.C. 102/103 based on amendment have been considered, however, have not been deemed persuasive. 
With respect to Applicant's argument that the evaluation value in Hara is not based on the output, Examiner respectfully disagrees.  Applicant admits on p. 8 of their response that Hara discloses that the evaluation values may be calculated based on "classification accuracy" which is a function of the output, or "distance between the output data from the first neural network, and the desired output of the training data".  Examiner asserts that the evaluation value is a measure of prediction quality.    	
With respect to Applicant's argument that Hara teaches incremental training of multiple neural networks simultaneously, and therefore does not anticipate the instant, Examiner respectfully disagrees.  By Applicant's own admission Hara still teaches the training of a neural network based on an evaluation value.  Examiner asserts that despite the disclosure of Hara going beyond anticipating the claimed invention, the relevant portions of Hara are still relevant.  
With respect to Applicant's argument that Hara does not teach an inference mode, Examiner respectfully disagrees.  Hara explicitly teaches using a trained neural network (which would have been trained in a training mode) to classify data (a classification mode interpreted as synonymous with in an inference mode) ([¶0045] "At S110, a training section, such as the training section 110, may obtain training data for training neural networks. The training data may include at least one set of input data and desired output data corresponding to the input data. For example, the input data may be image data (e.g., image data of an animal) and the output data may be a classified category of the image data (e.g., a name of the animal shown in the image data).").
With respect to Applicant's argument that Prakash does not fairly teach or suggest providing training data to the neural network via sensors, Examiner respectfully disagrees.  Prakash teaches ([¶0043] "For edge-cloud ML or distributed learning, ML training is performed on a dataset to learn parameters of an underlying model ß" Learned parameters interpreted as synonymous with trained operational parameters.  [¶0053] "the training datasets x1-xm may be locally available or accessible at each of the edge compute nodes 101, 201, and the MEC system 200 instructs the edge compute nodes 101, 201 to perform training tasks ß1-ßd using the locally available/accessible training datasets x1-xm. The locally available/accessible datasets x1-xm may be stored in local storage/memory circuitry of the edge compute nodes 101, 201; may be accessed via a direct link 105 from a remote sensor...In this example, the dataset x3 may include sensor data captured by IoT UEs 101 x." Training dataset interpreted as synonymous with training operational parameters.).  Prakash further teaches that said trained parameters may be used for classification tasks ([¶0022] "Gradient Descent (GD) algorithms and/or its variants are one critical component of many ML algorithms where training is performed on a large amount of data to estimate an underlying ML model. Linear regression is one such method that is used for many use cases including, for example, classification").  


Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claims 9-13 are rejected under 35 U.S.C. 102 as being unpatentable over Hara (US 2017/0228639 A1).

	Regarding claim 9, Hara teaches A method comprising: producing operational parameters of a machine learning tool based on a primary set of training data; ([¶0045] "At S110, a training section, such as the training section 110, may obtain training data for training neural networks. The training data may include at least one set of input data and desired output data corresponding to the input data." [¶0047] "The learning setting is a setting used for training of first neural networks. The first neural network may be a neural network that is generated during the model generation phase. For example, the setting may be a hyper parameter or a set of hyper parameters of neural networks, which does not change after the training.")
	applying input data to the machine learning tool being used in an inference mode to generate an output of the machine learning tool; ([¶0055] " For example, the calculating section may input the input data of the training data into the first neural network with the learning setting and weight data, and obtain the output data from the first neural network.")
	in response to determining a measure of prediction quality of the output of the machine learning tool is below a threshold, initiating incremental training of the operational parameters using the input data as training data for the machine learning tool; and ([¶0075] "At S240...the selecting section may determine whether the evaluation value of at least one of the first neural networks and the second neural networks exceeds a threshold or not. If the decision is positive, then the selecting section may proceed with an operation of S250, and if the decision is negative, then the selecting section may go back to S170 to train a new second neural network with a new and different setting." [¶0069] "At S180, the training section may incompletely train a second neural network with the new setting based on the training data. For example, the training section may iterate updating epochs to update weight data of the second neural network t times" See FIG. 2 for flow chart of process, S170 necessarily leads to S180 in which incremental training occurs.)
	storing updated operational parameters of the machine learning tool, the updated operational parameters being based on the incremental training. ([¶0011] "In addition, it is a sixth aspect of the innovations herein to provide the apparatus of the fifth aspect, wherein the instructions further cause the processor to: update the predictive model based on the plurality of neural networks without the at least one neural network. According to a sixth aspect, the apparatus may further improve accuracy of the predictive model." See FIG. 6 [¶0101] "The CPU 2000 may perform various types of processing, onto the data read from a memory such as the RAM 2020, which includes various types of operations, processing of information, condition judging, search/replace of information, etc., as described in the present embodiment and designated by an instruction sequence of programs, and writes the result back to the memory such as the RAM 2020"). 

	Regarding claim 10, Hara teaches The method of claim 9, wherein the machine learning tool is a deep neural network comprising a plurality of hidden layers. ([¶0028] "FIG. 1 shows an apparatus 100 according to an embodiment of the present invention. The apparatus 100 may train neural networks (e.g., Deep Neural Networks (DNNs) and Convolutional Neural Networks (CNNs))" [¶0081] "FIG. 3 shows an exemplary configuration of a neural network processed by the apparatus 100. The neural network includes a plurality of layers, each of which includes one or more neurons. As shown in FIG. 3, a (l−1)-th layer 302, a l-th layer 304, and a (l+1)-th layer 306" A hidden layer is interpreted as any layer between input and output layers.). 

	Regarding claim 11, Hara teaches The method of claim 10, wherein the measure of prediction quality is a function of an output of a hidden layer from the plurality of hidden layers of the deep neural network. ([¶0055] "Then, the calculating section may calculate the evaluation value based on classification accuracy (e.g., a match rate) of the output data from the first neural network and the desired output data of the training data." See also FIG. 3.  In light of the specification the claimed function of an output of hidden layers is interpreted as simply the neural network output which has been routed through hidden layers. Evaluation value is interpreted as synonymous with prediction quality.). 

	Regarding claim 12, Hara teaches The method of claim 9, wherein initiating the incremental training of the machine learning tool comprises calculating a gradient function of the machine learning tool. ([¶0125] "In the example of FIG. 4, training is performed for a supervised machine learning problem (e.g., a GD algorithm) based on a training dataset that is distributed across edge compute nodes 2101, where each edge compute node 2101 locally compute partial gradients and communicate those partial gradients to the master node" See FIG. 4.  Calculating partial gradient is first step of incremental training.). 

	Regarding claim 13, Hara teaches The method of claim 9, wherein the generated output of the machine learning tool is stored only on a local device performing the incremental training and is not transmitted to a server computer. ([¶0108] " Computer readable program instructions for carrying out operations of the present invention may ...execute entirely on the users computer, partly on the users computer, as a stand-alone software package, partly on the users computer and partly on a remote computer or entirely on the remote computer or server." Local device is interpreted as being synonymous with users computer.  Present invention in the prior art refers to a machine learning tool.). 

	 Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1,2, and 14-19 are rejected under 35 U.S.C. 103 as being unpatentable over Prakash (US 2019/0138934 A1) and in view of Hara.

	Regarding claim 1, Prakash teaches A computing system comprising: a processor in communication with an input sensor and a computer-readable medium storing trained operational parameters of a neural network model, the processor configured to: ([¶0095] "FIG. 2 depicts an example distributed machine learning procedure 200 according to a first embodiment...In some embodiments, the data collector nodes 2102 may also provide their operational parameters to the master node 2112 in a same or similar manner as the edge compute nodes 2101...at operation 206, one or more data collector nodes 2102 and edge compute nodes 2101 provide data to the master node 2112, which may include raw sensor data or other suitable types of data ")
	apply input data collected by the input sensor to the neural network model to generate a classification of the input data based on the trained operational parameters; ([¶0004] "Linear regression is one type of supervised ML algorithm that is used for classification...Gradient descent (GD) algorithms are often used in linear regression." [¶0095] "at operation 206, one or more data collector nodes 2102 and edge compute nodes 2101 provide data to the master node 2112, which may include raw sensor data or other suitable types of data" See FIG. 2 Classification is a well known aspect of machine learning as is supported by Prakash [¶0004].).
	However, Prakash does not explicitly teach measure a prediction quality of the classification of the input data; 
	determine whether the prediction quality is below a threshold quality level; in response to determining the prediction quality is below the threshold quality level, initiate incremental training of the neural network model using the input data as training data for the neural network model, wherein an output of the incremental training is updated operational parameters of the neural network model; and 
	store the updated operational parameters of the neural network model on the computer-readable medium so that the neural network model operates according to the updated operational parameters.  

Hara, in the same field of endeavor, teaches measure a prediction quality of the classification of the input data; ([¶0056] "Then, the calculating section may calculate the evaluation value based on classification accuracy (e.g., a match rate) of the output data from the first neural network and the desired output data of the training data." Prediction performance is interpreted as synonymous with classification quality.)
	determine whether the prediction quality is below a threshold quality level; in response to determining the prediction quality is below the threshold quality level, initiate incremental training of the neural network model using the input data as training data for the neural network model, wherein an output of the incremental training is updated operational parameters of the neural network model; and ([¶0075] "At S240...the selecting section may determine whether the evaluation value of at least one of the first neural networks and the second neural networks exceeds a threshold or not. If the decision is positive, then the selecting section may proceed with an operation of S250, and if the decision is negative, then the selecting section may go back to S170 to train a new second neural network with a new and different setting." [¶0069] "At S180, the training section may incompletely train a second neural network with the new setting based on the training data. For example, the training section may iterate updating epochs to update weight data of the second neural network t times" See FIG. 2 for flow chart of process, S170 necessarily leads to S180 in which incremental training occurs.)
	store the updated operational parameters of the neural network model on the computer-readable medium so that the neural network model operates according to the updated operational parameters. ([¶0011] "In addition, it is a sixth aspect of the innovations herein to provide the apparatus of the fifth aspect, wherein the instructions further cause the processor to: update the predictive model based on the plurality of neural networks without the at least one neural network. According to a sixth aspect, the apparatus may further improve accuracy of the predictive model." See FIG. 6 [¶0101] "The CPU 2000 may perform various types of processing, onto the data read from a memory such as the RAM 2020, which includes various types of operations, processing of information, condition judging, search/replace of information, etc., as described in the present embodiment and designated by an instruction sequence of programs, and writes the result back to the memory such as the RAM 2020"). 

	 Prakash and Hara are both directed towards optimizing a machine learning model, with special emphasis on neural networks.  Therefore, Prakash and Hara are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teachings of Prakash and Hara by using model accuracy as a measure of convergence. The combination would have been obvious because a person of ordinary skill in the art would be able to determine from Hara that the method ([¶0042] “may improve prediction accuracy of the predictive model, and thereby may efficiently determine an optimized setting of the neural network by terminating at least part of the training of the neural networks by predicting the performance from the tentative weight data”).

	Regarding claim 2, the combination of Prakash, and Hara teaches The computing system of claim 1, wherein initiating the incremental training of the neural network model comprises determining a privacy setting of the computing system. (Prakash [¶0044] "The training process (model) β may have a set of requirements (e.g., latency, processing resources, storage resources, network resources, location, network capability, security conditions, etc.) that need to be fulfilled by individual edge nodes" Security condition interpreted as synonymous with privacy setting.  Edge node may be a computing system). 

	Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over Hara and in view of Prakash. 

	Regarding claim 14, Hara teaches A method of training operational parameters of a machine learning tool, the method comprising: training the operational parameters of the machine learning tool based on an initial set of training data; ([¶0045] "At S110, a training section, such as the training section 110, may obtain training data for training neural networks. The training data may include at least one set of input data and desired output data corresponding to the input data." [¶0047] "The learning setting is a setting used for training of first neural networks. The first neural network may be a neural network that is generated during the model generation phase. For example, the setting may be a hyper parameter or a set of hyper parameters of neural networks, which does not change after the training.").
	However, Hara does not explicitly teach transmitting the operational parameters of the machine learning tool to an edge device via an interconnection network; 
	receiving additional training data from the edge device, the additional training data selected based on a measure of quality applied to an output of the machine learning tool executing at the edge device; 
	performing incremental training of the operational parameters using the additional training data received from the edge device to generate updated operational parameters; and 
	transmitting the updated operational parameters to the edge device.  

Prakash teaches transmitting the operational parameters of the machine learning tool to an edge device via an interconnection network; ([¶0055] "The UEs 101 may also be referred to as “edge devices,” “compute nodes,” “edge compute nodes,” and/or the like" [¶0277] "wherein a number of data points included in each training data partition is based on operational parameters of the corresponding heterogeneous compute nodes...the communication circuitry arranged to transmit each training data partition to the corresponding heterogeneous compute nodes")
	receiving additional training data from the edge device, the additional training data selected based on a measure of quality applied to an output of the machine learning tool executing at the edge device; ([¶0099] "Gradient descent (GD) is an optimization algorithm used to minimize a target function by iteratively moving in the direction of a steepest descent as defined by a negative of the gradient...The cost function indicates how accurate the model β is at making predictions for a given set of parameters. The cost function has a corresponding curve and corresponding gradients, where the slope of the cost function curve indicates how the parameters should be changed to make the model β more accurate. In other words, the model β is used to make predictions, and the cost function is used to update the parameters for the model β." [¶0277] " and communication circuitry communicatively coupled with the processor circuitry, the communication circuitry arranged to transmit each training data partition to the corresponding heterogeneous compute nodes, and receive computed partial gradients from a subset of the corresponding heterogeneous compute nodes, and wherein the processor circuitry is arranged to determine updated ML training parameters based on an aggregation of the received computed partial gradients" The explanation of the gradient descent cost function in Prakash teaches selecting training data based on a measure of quality.  In light of the specification the claimed additional training data from edge device interpreted as synonymous with ML training parameters sent from edge device as partial gradients and converted to training parameters in processor.  The computed partial gradients in ¶0277 refer to the computation process in ¶0099.)
	performing incremental training of the operational parameters using the additional training data received from the edge device to generate updated operational parameters; and ([¶0039] " In each epoch, partial gradients received from the edge compute nodes and partial gradients computed from (en)coded training data available at the master node are aggregated or combined such that little to no decoding complexity is incurred. The master node combines the partial gradients obtained from the encoded data points with the partial gradients obtained from the uncoded data points iteratively until the underlying ML model converges." [¶0089] "The processor circuitry is arranged to determine updated ML training parameters (e.g., a full or complete gradient) based on an aggregation of the received computed partial gradients" Iterative convergence of the machine learning model is interpreted as synonymous with incremental training.)
	transmitting the updated operational parameters to the edge device. ([¶0024] "Each compute node communicates their partial gradient back to a master node after the partial gradients are computed. The master node computes a full or complete gradient by combining all of the partial gradients received from all worker compute nodes. The master compute node updates the reference model, and then communicates the updated reference model to all worker compute node for the next epoch."). 

	Prakash and Hara are both directed towards optimizing a machine learning model, with special emphasis on neural networks.  Therefore, Prakash and Hara are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teachings of Prakash and Hara by using model accuracy as a measure of convergence. The combination would have been obvious because a person of ordinary skill in the art would be able to determine from Hara that the method ([¶0042] “may improve prediction accuracy of the predictive model, and thereby may efficiently determine an optimized setting of the neural network by terminating at least part of the training of the neural networks by predicting the performance from the tentative weight data”).

	Regarding claim 15, the combination of Hara, and Prakash teaches The method of claim 14, wherein the machine learning tool uses a deep neural network comprising a plurality of hidden layers, and the operational parameters include weights of edges of the deep neural network. (Hara [¶0028] "FIG. 1 shows an apparatus 100 according to an embodiment of the present invention. The apparatus 100 may train neural networks (e.g., Deep Neural Networks (DNNs) and Convolutional Neural Networks (CNNs)) generate a predictive model to predict a performance of eventual neural networks based on weight data of neural networks" [¶0081] "FIG. 3 shows an exemplary configuration of a neural network processed by the apparatus 100. The neural network includes a plurality of layers, each of which includes one or more neurons. As shown in FIG. 3, a (l−1)-th layer 302, a l-th layer 304, and a (l+1)-th layer 306" A hidden layer is interpreted as any layer between input and output layers. Weights of edges of the DNN is interpreted as weights between nodes which are well-known in the art and not weights of edge devices.). 

	Regarding claim 16, the combination of Hara and Prakash teaches The method of claim 14, wherein the additional training data is input data of the machine learning tool, the input data collected at the edge device. (Prakash [¶0095] "FIG. 2 depicts an example distributed machine learning procedure 200 according to a first embodiment...In some embodiments, the data collector nodes 2102 may also provide their operational parameters to the master node 2112 in a same or similar manner as the edge compute nodes 2101...at operation 206, one or more data collector nodes 2102 and edge compute nodes 2101 provide data to the master node 2112, which may include raw sensor data or other suitable types of data " Operational parameters from data collector nodes interpreted as synonymous with additional training input data.). 

	Regarding claim 17, the combination of Hara and Prakash teaches The method of claim 14, wherein the additional training data is a gradient of the machine learning tool calculated by back-propagating an output of the machine learning tool, the output generated using input data collected at the edge device. (Prakash [¶0099] "Gradient descent (GD) is an optimization algorithm used to minimize a target function by iteratively moving in the direction of a steepest descent as defined by a negative of the gradient...The cost function indicates how accurate the model β is at making predictions for a given set of parameters. The cost function has a corresponding curve and corresponding gradients, where the slope of the cost function curve indicates how the parameters should be changed to make the model β more accurate. In other words, the model β is used to make predictions, and the cost function is used to update the parameters for the model β." [¶0277] " and communication circuitry communicatively coupled with the processor circuitry, the communication circuitry arranged to transmit each training data partition to the corresponding heterogeneous compute nodes, and receive computed partial gradients from a subset of the corresponding heterogeneous compute nodes, and wherein the processor circuitry is arranged to determine updated ML training parameters based on an aggregation of the received computed partial gradients" the gradient descent method taught by Prakash makes explicit use of a cost function also commonly known as a loss function which is the standard method of backpropagation in machine learning.). 

	Regarding claim 18, the combination of Hara and Prakash teaches The method of claim 14, further comprising evaluating a trust-level of the edge device, and wherein the additional training data is weighted based on the trust-level of the edge device when the incremental training is performed. (Prakash [¶0107] "In this example, the edge compute nodes 2101 provide a link/channel quality indication for their respective links 103, 107 (e.g., given as (rk, pk), where k is a number from 1 to n), and a processing capabilities indication" See 209 in FIG. 2 link/channel quality is interpreted as synonymous with edge device trust-level.  Prakash teaches that link quality is determined at initialization of incremental training in FIG. 2). 

	Regarding claim 19, the combination of Hara and Prakash teaches The method of claim 14, wherein performing incremental training comprises using both a subset of the initial set of training data and the additional training data as inputs to the machine learning tool during a training mode of the machine learning tool. (Prakash [¶0113] "According to various embodiments, the master node 2112 decodes the partial gradients by aggregating the partial gradients. At each epoch, the master node receives partial gradients corresponding to the subsets of the coded dataset assigned to the working compute nodes 2101" [¶0122] " In some embodiments, the data collector nodes 2102 may also provide their operational parameters to the master node 2112 in a same or similar manner as the edge compute nodes 2101. Meanwhile, at operation 406, one or more data collector nodes 2102 and/or edge compute nodes 2101 provide data parameters to the master node 2112, which may include information about the particular type of data locally accessible by the edge compute nodes 2101 and data collectors 2102"). 

Claims 3-5 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Prakash, and Hara, and in further view of Kiraly (US-20180129900-A1).

	Regarding claim 3, the combination of Prakash and Hara teaches The computing system of claim 2, wherein the initiated incremental training of the neural network model comprises calculating a gradient function of the neural network model. (Prakash [¶0037] "The disclosed embodiments enable coding on distributed datasets while ensuring user privacy of training data provided by each edge compute node. Embodiments herein provide coding mechanisms for federated learning based GD algorithms trained from decentralized data available at a plurality of edge compute nodes." [¶0038] "In the second embodiments, the individual edge compute nodes encode locally available data for ML training. At each epoch, the edge compute nodes locally compute partial gradients from local uncoded training data." GD or gradient descent algorithm is interpreted as gradient function of the neural network model.  In light of the specification the claimed gradient function is a method of aggregating training across edge devices in order to protect privacy as is explicitly taught by Prakash.).
	However, the combination of Prakash and Hara does not explicitly teach the privacy setting is determined to be private  

Kiraly, in the same field of endeavor, teaches the privacy setting is determined to be private ([¶0044] "The choice of n decides what portion of the network may be retrained versus the amount of abstraction of the original data. Setting n=0 results in a standard cloud-based application where the bare data is sent and may be used for any purpose without a mechanism to guarantee off-label use to the user. Higher values of n result in more abstract data and more “frozen” layers (1−n) that cannot be retrained without re-deployment or updates to the local machine." Setting n>0 is interpreted as privacy setting of private.). 

	The combination of Prakash and Hara as well as Kiraly are all directed towards optimizing distributed machine learning systems with an emphasis on neural networks.  Therefore, Prakash, Hara, and Kiraly are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teachings of Prakash and Hara with the teachings of Kiraly by using a privacy setting in a distributed neural network. Kiraly teaches as a motivation for combination ([¶0066] “Personal or other photographs may benefit from advanced algorithms to identify and label people and/or locate scenes. However, privacy is a concern if the images are loaded into the cloud for processing, such as not wanting a person to be identified in a way that may be stolen by others.”).

	Regarding claim 4, the combination of Prakash, Hara, and Kiraly teaches The computing system of claim 3, wherein the initiated incremental training of the neural network model comprises transmitting an output of gradient function to a server computer so that the server computer performs the incremental training of the neural network model based on the output of the gradient function. (Prakash [¶0039] " In each epoch, partial gradients received from the edge compute nodes and partial gradients computed from (en)coded training data available at the master node are aggregated or combined such that little to no decoding complexity is incurred. The master node combines the partial gradients obtained from the encoded data points with the partial gradients obtained from the uncoded data points iteratively until the underlying ML model converges." Iterative convergence of the machine learning model is interpreted as synonymous with incremental training.). 

	Regarding claim 5, the combination of Prakash and Hara teaches The computing system of claim 2.
	However, the combination of Prakash and Hara does not explicitly teach that the privacy setting is determined to be public, and the initiated incremental training of the neural network model comprises transmitting the received input data to a server computer.  

Kiraly, in the same field of endeavor, teaches the privacy setting is determined to be public, and the initiated incremental training of the neural network model comprises transmitting the received input data to a server computer. ([¶0044] "The choice of n decides what portion of the network may be retrained versus the amount of abstraction of the original data. Setting n=0 results in a standard cloud-based application where the bare data is sent and may be used for any purpose without a mechanism to guarantee off-label use to the user." Setting n=0 interpreted as synonymous with privacy setting of private.). 

	The combination of Prakash and Hara as well as Kiraly are all directed towards optimizing distributed machine learning systems with an emphasis on neural networks.  Therefore, Prakash, Hara, and Kiraly are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teachings of Prakash and Hara with the teachings of Kiraly by using a privacy setting in a distributed neural network. Kiraly teaches as a motivation for combination ([¶0066] “Personal or other photographs may benefit from advanced algorithms to identify and label people and/or locate scenes. However, privacy is a concern if the images are loaded into the cloud for processing, such as not wanting a person to be identified in a way that may be stolen by others.”).

Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Prakash, and Hara and in further view of Surazhsky (US-20180268255-A1).
	Regarding claim 6, the combination of Prakash and Hara teaches The computing system of claim 1.
	However, the combination of Prakash and Hara does not explicitly teach the processor is further configured to determine the prediction quality is below the threshold quality level by determining the input data was misclassified.  

Surazhsky, in the same field of endeavor, teaches The computing system of claim 1, wherein the processor is further configured to determine the prediction quality is below the threshold quality level by determining the input data was misclassified. ([¶00049] "The training engine 110 may determine, based at least on a result of processing the one or more additional synthetic images, whether a performance of the machine learning model meets a threshold value (413). For example, the training engine 110 (e.g., the performance auditor 214) may determine whether the convolutional neural network is able to correctly classify a threshold quantity (e.g., number and/or percentage) of images in a training set and/or a validation set, which may include synthetic images having changed perspectives. Alternately and/or additionally, the training engine 110 (e.g., the performance auditor 214) may determine whether a quantity (e.g., number and/or percentage) of misclassified images is below a threshold value."). 

	 Prakash, Hara, and Surazhsky are all directed towards optimizing a machine learning model with an emphasis on training neural networks.  Therefore, Prakash, Hara, and Surazhsky are all analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teachings of Prakash and Hara with the teachings of Surazhsky by using misclassification as an indication of poor prediction quality. The combination would have been obvious because a person of ordinary skill in the art would be able to determine from Surazhsky that this level of performance auditing will allow the system to detect types of failures [¶0040].

	Claims 7 and 8 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Prakash, and Hara and in further view of Yu (US-20170127016-A1).

	Regarding claim 7, the combination of Prakash and Hara teaches The computing system of claim 1.
	However, the combination of Prakash and Hara does not explicitly teach, wherein neurons of a last layer of the neural network model use a soft-max activation function, and the processor is further configured to determine the prediction quality is below the threshold quality level by determining a perplexity function based on outputs of the last layer of the neural network model and a one-hot vector.  

Yu, in the same field of endeavor, teaches The computing system of claim 1, wherein neurons of a last layer of the neural network model use a soft-max activation function, and the processor is further configured to determine the prediction quality is below the threshold quality level by determining a perplexity function based on outputs of the last layer of the neural network model and a one-hot vector. ([¶0053] " In embodiments, the last state 262 of the recurrent layer I 222 may be taken as a compact representation for the sentence" See FIG. 5. [¶0058] "The cost of generating that training word may then be defined as the negative logarithm of the likelihood. The cost of generating the whole paragraph s1:N (N is the number of sentences in the paragraph) may further be defined as...The above cost is in fact the perplexity" The sentence in Yu represents the last state of the output layer of the neural network which is interpreted as the last layer.  In light of the specification, the claimed perplexity is interpreted as an inverse metric of prediction quality or accuracy.  Because of this relationship and because Hara explicitly teaches determining if the prediction quality is below a threshold quality, using the perplexity taught in Yu to determine the threshold quality would yield a well-known and expected outcome.). 

Prakash, Hara, and Yu are all directed towards optimizing machine learning systems with an emphasis on neural networks.  Therefore, Prakash, Hara, and Yu are analogous art in the same field of endeavor. It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combing the teachings of Prakash and Hara with the teachings of Yu by using a softmax layer, one-hot vector, and perplexity function in a neural network. The combination would have been obvious because a person of ordinary skill in the art would be able to determine from Yu that these methods are well-known in automated video-captioning, and that ([¶0025] “inspiring results have been achieved by a recent line of video-captioning work, which benefits from the rapid development of deep neural networks, especially Recurrent Neural Network (RNN).”).

	Regarding claim 8, the combination of Prakash and Hara teaches The computing system of claim 1.
	However, the combination of Prakash and Hara does not explicitly teach the processor is further configured to determine the prediction quality is below the threshold quality level by determining a perplexity function based on outputs of a mixture of layers of the neural network model.  

Yu, in the same field of endeavor, teaches The computing system of claim 1, wherein the processor is further configured to determine the prediction quality is below the threshold quality level by determining a perplexity function based on outputs of a mixture of layers of the neural network model. ([¶0053] " In embodiments, the last state 262 of the recurrent layer I 222 may be taken as a compact representation for the sentence" See FIG. 5. [¶0058] "The cost of generating that training word may then be defined as the negative logarithm of the likelihood. The cost of generating the whole paragraph s1:N (N is the number of sentences in the paragraph) may further be defined as...The above cost is in fact the perplexity" The sentence in Yu represents the last state of the output layer of the neural network which is interpreted as the last layer.  In light of the specification, the claimed perplexity is interpreted as an inverse metric of prediction quality or accuracy.  Because of this relationship and because Hara explicitly teaches determining if the prediction quality is below a threshold quality, using the perplexity taught in Yu to determine the threshold quality would yield a well-known and expected outcome.  Furthermore, the final or output layer of a neural network is a representation of the outputs of a mixture of the layers of the neural network, therefore determining a perplexity function based on the final layer is synonymous with determining a perplexity function based on a mixture of layers.). 

Prakash, Hara, and Yu are all directed towards optimizing machine learning systems with an emphasis on neural networks.  Therefore, Prakash, Hara, and Yu are analogous art in the same field of endeavor. It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combing the teachings of Prakash and Hara with the teachings of Yu by using a softmax layer, one-hot vector, and perplexity function in a neural network. The combination would have been obvious because a person of ordinary skill in the art would be able to determine from Yu that these methods are well-known in automated video-captioning, and that ([¶0025] “inspiring results have been achieved by a recent line of video-captioning work, which benefits from the rapid development of deep neural networks, especially Recurrent Neural Network (RNN).”).


	Claim 20 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Hara, and Prakash and in further view of Gupta (US-20190008461-A1).

	Regarding claim 20, the combination of Hara and Prakash teaches The method of claim 14.
	However, the combination of Hara and Prakash does not explicitly teach the incremental training is delayed until a threshold amount of additional training data is received.  

Gupta, in the same field of endeavor, teaches The method of claim 14, wherein the incremental training is delayed until a threshold amount of additional training data is received. ([¶0160] "In another implementation, the “training mode” is implemented until a pre-determined number of data points (i.e., a pre-determined amount of data having corresponding time points) are collected. In this implementation, the threshold for data collection is the pre-determined number of data points; and therefore, the determination of whether the threshold for data collection has been met includes determining if a number of collected data points is greater than the pre-determined number of data points... the data collection threshold may be dynamically specified, as opposed to a predetermined value which, while conservative, may cause undue delay and utilization of sensor processing resources (and hence power consumption)."). 

	Hara, Prakash, and Gupta are all directed towards optimizing machine learning models, with an emphasis on neural networks.  Therefore, Hara, Prakash, and Gupta are analogous art in the same field of endeavor.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teachings of Hara and Prakash with the teachings of Gupta by waiting for a threshold amount of training data before beginning training in a machine learning system. The combination would have been obvious because a person of ordinary skill in the art would be able to determine from Gupta ([¶0160] “It will be appreciated that other threshold criteria and/or a combination of threshold criteria may be utilized to determine whether the sensor system has collected sufficient “training mode” data.”).

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SIDNEY VINCENT BOSTWICK whose telephone number is (571)272-4720. The examiner can normally be reached M-F 7:30am-5:00pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571)270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SB/Examiner, Art Unit 2124                                                                                                                                                                                                        
/LUIS A SITIRICHE/Primary Examiner, Art Unit 2126