DETAILED ACTION
This action is in response to the claims filed 02/03/2022 for application 16/325,348. Claims 1, 2, 5 and 6 have been amended. Claims 1-6 are currently pending. Applicant’s arguments regarding the 102 rejection and the prior art of Misra are persuasive, therefore the previous rejection has been withdrawn. This action is made NON-FINAL.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Regarding claim 1,
Step 1: The claim recites a machine, one of the four categories of eligible subject matter.
Step 2A Prong 1: The claim recites the following limitations:
calculating, for each of the plurality of tasks, a corresponding batch size which meets a condition …
sampling, for each of the plurality of tasks, samples from the corresponding learning data …
updating a corresponding weight of a discriminator for each of the tasks, using the samples sampled …
These limitations each recite a mental process of deciding, which can reasonably be performed in the mind with the aid of pen and paper. Accordingly, the claim recites an abstract idea.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites the following additional elements:
a processor
a memory
Merely using a computer as a tool to perform an abstract idea, as discussed in MPEP § 2106.05(f), does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
The further recites the following additional element:
	a single neural network
Merely using a neural network as a tool to perform an abstract idea, as discussed in MPEP § 2106.06(h), does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
	performs multi-task training of the neural network using stochastic gradient descent by performing…
	accepting, as input, corresponding learning data having a corresponding data size for a respective plurality of tasks
Adding insignificant extra-solution activity to the judicial exception, as discussed in MPEP § 2106.05(g), does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
The claim is directed to an abstract idea.
Step 2B: The claim recites the following additional elements:
a processor
a memory
Merely using a computer as a tool to perform an abstract idea, as discussed in MPEP § 2106.05(f), does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
The further recites the following additional element:
	a single neural network
Merely using a neural network as a tool to perform an abstract idea, as discussed in MPEP § 2106.06(h), does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
	performs multi-task training of the neural network using stochastic gradient descent by performing… 
This is considered to be a well-understood, routine, and conventional step as evidenced by Li et al. (Abstract). As discussed in MPEP § 2106.05(d), adding a well-understood, routine and conventional step does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea.
accepting, as input, corresponding learning data having a corresponding data size for a respective plurality of tasks 
Adding insignificant extra-solution activity to the judicial exception, as discussed in MPEP § 2106.05(g), does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given in Step 2A Prong 2. (see MPEP § 2106.05.I.A.)
The claim is not patent eligible.

Regarding claim 2,
Step 1: The claim recites a machine, one of the four categories of eligible subject matter.
Step 2A Prong 1: The claim recites the following limitations:
calculating, for each of the plurality of tasks, a corresponding batch size which meets a condition …
sampling, for each of the plurality of tasks, samples from the corresponding learning data …
updating a corresponding weight of a discriminator for each of the tasks, using the samples sampled …
These limitations each recite a mental process of deciding, which can reasonably be performed in the mind with the aid of pen and paper. Accordingly, the claim recites an abstract idea.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites the following additional elements:
a processor
a memory
Merely using a computer as a tool to perform an abstract idea, as discussed in MPEP § 2106.05(f), does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
The further recites the following additional element:
	a single neural network
Merely using a neural network as a tool to perform an abstract idea, as discussed in MPEP § 2106.06(h), does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
performs multi-task training of the neural network using stochastic gradient descent by performing…
accepting, as input, corresponding learning data having a corresponding data size for a respective plurality of tasks 
Adding insignificant extra-solution activity to the judicial exception, as discussed in MPEP § 2106.05(g), does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
The claim is directed to an abstract idea.
Step 2B: The claim recites the following additional elements:
a processor
a memory
Merely using a computer as a tool to perform an abstract idea, as discussed in MPEP § 2106.05(f), does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
The further recites the following additional element:
	a single neural network
Merely using a neural network as a tool to perform an abstract idea, as discussed in MPEP § 2106.06(h), does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
performs multi-task training of the neural network using stochastic gradient descent by performing… 
This is considered to be a well-understood, routine, and conventional step as evidenced by Li et al. (Abstract). As discussed in MPEP § 2106.05(d), adding a well-understood, routine and conventional step does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea.
accepting, as input, corresponding learning data having a corresponding data size for a respective plurality of tasks 
Adding insignificant extra-solution activity to the judicial exception, as discussed in MPEP § 2106.05(g), does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given in Step 2A Prong 2. (see MPEP § 2106.05.I.A.)
The claim is not patent eligible.



Regarding claim 3,
Step 1: The claim recites a machine, one of the four categories of eligible subject matter.
Step 2A Prong 1: The claim recites the following limitations:
perform a discrimination process using the input …
These limitation recites a mental process of deciding, which can reasonably be performed in the mind with the aid of pen and paper. Accordingly, the claim recites an abstract idea.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites the following additional elements:
a signal processing device
the learning device according to claim 1
Merely using a computer as a tool to perform an abstract idea, as discussed in MPEP § 2106.05(f), does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
an input information processor
Adding insignificant extra-solution activity to the judicial exception, as discussed in MPEP § 2106.05(g), does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
The claim is directed to an abstract idea.
Step 2B: The claim recites the following additional elements:
a signal processing device
the learning device according to claim 1
Merely using a computer as a tool to perform an abstract idea, as discussed in MPEP § 2106.05(f), does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
an input information processor
Adding insignificant extra-solution activity to the judicial exception, as discussed in MPEP § 2106.05(g), does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
The claim is directed to an abstract idea.
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given in Step 2A Prong 2. (see MPEP § 2106.05.I.A.)
The claim is not patent eligible.

Regarding claim 4,
Step 1: The claim recites a machine, one of the four categories of eligible subject matter.
Step 2A Prong 1: The claim recites the following limitations:
perform a discrimination process using the input …
These limitation recites a mental process of deciding, which can reasonably be performed in the mind with the aid of pen and paper. Accordingly, the claim recites an abstract idea.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites the following additional elements:
a signal processing device
the learning device according to claim 2
Merely using a computer as a tool to perform an abstract idea, as discussed in MPEP § 2106.05(f), does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
an input information processor
Adding insignificant extra-solution activity to the judicial exception, as discussed in MPEP § 2106.05(g), does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
The claim is directed to an abstract idea.
Step 2B: The claim recites the following additional elements:
a signal processing device
the learning device according to claim 1
Merely using a computer as a tool to perform an abstract idea, as discussed in MPEP § 2106.05(f), does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
an input information processor
Adding insignificant extra-solution activity to the judicial exception, as discussed in MPEP § 2106.05(g), does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
The claim is directed to an abstract idea.
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given in Step 2A Prong 2. (see MPEP § 2106.05.I.A.)
The claim is not patent eligible.

Regarding claim 5,
Step 1: The claim recites a method, one of the four categories of eligible subject matter.
Step 2A Prong 1: The claim recites the following limitations:
calculating, for each of the plurality of tasks, a corresponding batch size which meets a condition …
sampling, for each of the plurality of tasks, samples from the corresponding learning data …
updating a corresponding weight of a discriminator for each of the tasks, using the samples sampled …
These limitations each recite a mental process of deciding, which can reasonably be performed in the mind with the aid of pen and paper. Accordingly, the claim recites an abstract idea.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites the following additional elements:
performs multi-task training of a single neural network using stochastic gradient descent for…

	accepting, as input, corresponding learning data having a corresponding data size for a respective plurality of tasks
Merely using a computer as a tool to perform an abstract idea, as discussed in MPEP § 2106.05(f), does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
The further recites the following additional element:
	a single neural network
Merely using a neural network as a tool to perform an abstract idea, as discussed in MPEP § 2106.06(h), does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
The claim is directed to an abstract idea.
Step 2B: The claim recites the following additional elements:
	accepting, as input, corresponding learning data having a corresponding data size for a respective plurality of tasks 
Merely using a computer as a tool to perform an abstract idea, as discussed in MPEP § 2106.05(f), does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
The further recites the following additional element:
	a single neural network
Merely using a neural network as a tool to perform an abstract idea, as discussed in MPEP § 2106.06(h), does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
performs multi-task training of a single neural network using stochastic gradient descent for… 
This is considered to be a well-understood, routine, and conventional step as evidenced by Li et al. (Abstract). As discussed in MPEP § 2106.05(d), adding a well-understood, routine and conventional step does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea.
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given in Step 2A Prong 2. (see MPEP § 2106.05.I.A.)
The claim is not patent eligible.


Regarding claim 6,
Step 1: The claim recites a method, one of the four categories of eligible subject matter.
Step 2A Prong 1: The claim recites the following limitations:
calculating, for each of the plurality of tasks, a corresponding batch size which meets a condition …
sampling, for each of the plurality of tasks, samples from the corresponding learning data …
updating a corresponding weight of a discriminator for each of the tasks, using the samples sampled …
These limitations each recite a mental process of deciding, which can reasonably be performed in the mind with the aid of pen and paper. Accordingly, the claim recites an abstract idea.
Step 2A Prong 2: This judicial exception is not integrated into a practical application. The claim recites the following additional elements:
performs multi-task training of a single neural network using stochastic gradient descent for… 
	accepting, as input, corresponding learning data having a corresponding data size for a respective plurality of tasks 
Merely using a computer as a tool to perform an abstract idea, as discussed in MPEP § 2106.05(f), does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
The further recites the following additional element:
	a single neural network
Merely using a neural network as a tool to perform an abstract idea, as discussed in MPEP § 2106.06(h), does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
The claim is directed to an abstract idea.
Step 2B: The claim recites the following additional elements:
	accepting, as input, corresponding learning data having a corresponding data size for a respective plurality of tasks
Merely using a computer as a tool to perform an abstract idea, as discussed in MPEP § 2106.05(f), does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
The further recites the following additional element:
	a single neural network
Merely using a neural network as a tool to perform an abstract idea, as discussed in MPEP § 2106.06(h), does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. 
performs multi-task training of a single neural network using stochastic gradient descent for… 
This is considered to be a well-understood, routine, and conventional step as evidenced by Li et al. (Abstract). As discussed in MPEP § 2106.05(d), adding a well-understood, routine and conventional step does not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea.
The claim is directed to an abstract idea.
The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception for the reasons given in Step 2A Prong 2. (see MPEP § 2106.05.I.A.)
The claim is not patent eligible.



Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-6 are rejected under 35 U.S.C. 103 as being unpatentable over Misra et al. (“Cross-stitch Networks for Multi-task Learning”, hereinafter "Misra") in view of Li et al. ("Efficient Mini-batch Training for Stochastic Optimization", hereinafter "Li") and further in view of Moritz et al. ("SparkNet: Training Deep Networks in Spark").

Regarding claim 1, Misra teaches A learning device for training a single neural network (“This paper proposes cross-stitch units, using which a single network can capture all these Split-architectures (and more). It automatically learns an optimal combination of shared and task-specific representations.” [pg. 3995, right col, top para]) for a plurality of tasks of different types using learning data whose data size varies from task to task (“In order to demonstrate the robustness and effectiveness of cross-stitch units in multi-task learning, we choose varied tasks on multiple datasets. In particular, we select four well established and diverse tasks on different types of image datasets:” [pg. 3996, ¶2]), comprising: 
a processor to execute a program; and 
a memory to store the program which, when executed by the processor (pg. 3999, Section 6, "Experiments"; pg.3995, col. 2, para. 3, "ConvNets in computer vision"). (It is clear that Misra performs their method on a computer), performs multi- task training of the neural network using stochastic gradient descent (“We fine-tune the network for semantic segmentation for 25k iterations using SGD” [pg. 3998, left col, ¶1]) by performing processes of, 
accepting, as input, corresponding learning data, having a corresponding data size, for each of the plurality of tasks (pg. 3997, Image input in Fig. 4; pg. 3996, col. 2, Section 3.3, para. 1, “Consider a case of multi task learning with two tasks A and B on the same input image.”);
sampling, for each of the plurality of tasks, samples from the corresponding learning data with the calculated corresponding batch size (pg. 3998, col. 1, para. 1, “We fine-tune the network for semantic segmentation for 25k iterations using SGD (mini-batch size 20) and for surface normal prediction for 15k iterations(mini-batch size 20)”; pg. 3999, col. 1, Section 6, para. 1, "Fast-RCNN carefully constructs mini-batches with 1 : 3 foreground-to-background ratio, i.e., at most 25% of foreground samples in a mini-batch."; pg. 3999, col. 1, Section 6, para. 1 – col.2 para. 1, “we use the same mini-batch sampling strategy; and in every mini-batch only the fore-ground samples contribute to the attribute loss (and back-ground samples are ignored)”); and 
updating a corresponding weight of a discriminator of the neural network for each of the plurality of tasks, using the samples sampled (pg. 3997, col. 2, Section 5, “We then finetune the network (referred to as one-task network) from ImageNet [9] for each task”; pg. 3998, col. 1, para. 1, “We fine-tune the network for semantic segmentation for 25k iterations using SGD (mini-batch size 20) and for surface normal prediction for 15k iterations(mini-batch size 20)). (1. Finetuning the neural network is interpreted as updating the weights of the neural network. Fine tuning means taking the weights of a trained neural network and updating the weights for further training of a model. See further, pg. 3998, §5.2, ¶1: “While training, we found that the gradient updates at various layers had magnitudes which were reasonable for updating the layer parameters, but too small for the cross-stitch units. Thus, we use higher learning rates for the cross-stitch units than the base network. In practice, this leads to faster convergence and better performance.” 2. Semantic segmentation and normal prediction are both examples discriminating between different objects/orientations in the scene.).
Although Misra discloses the batch sizes are the same, the reference doesn’t go into details of dividing the data size of the corresponding learning data for a task by the corresponding batch size for the task
Li teaches calculating, for each of the plurality of tasks, a corresponding batch size which meets a condition that a value obtained by dividing the data size of the corresponding learning data for a task by the corresponding batch size for the task (“We begin with a brief review of a naive variant of minibatch SGD. During training it processes a group of examples per iteration. For notational simplicity, assume that n is divisible by the number of mini-batches m. Then we partition the examples into m mini-batches, each of size b = n/m. Note that this assumption is not required neither for the proof nor for the implementation. Likewise, the pre-partitioning step is also not necessary in practice, however, it simplifies the exposition of what follows.” [pg. 662, § 2.1 Mini-Batch Stochastic Gradient Descent, ¶1]) 
Misra and Li are both in the same field of endeavor of training networks using mini-batch stochastic gradient descent. Misra teaches cross-stitch networks for multi-task learning. Li teaches a method for efficient mini-batch training for stochastic optimization. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Misra’s multi-task learning method by implementing the mini-batch stochastic gradient descent method as taught by Li. One would have been motivated to make this modification to reduce the communication cost for minibatch training. [Abstract, Li]
	The combination of Misra and Li fails to explicitly teach the values obtained by dividing the data size by the batch size are the same.
	Moritz teaches calculating, a value obtained by dividing the data size by the batch size are the same (“This figure depicts a parallel run of SGD on K = 4 machines under a naive parallelization scheme. At each iteration, each batch of size b is divided among the K machines, the gradients over the subsets are computed separately on each machine, the updates are aggregated, and the new model is broadcast to the workers. Algorithmically, this approach is exactly equivalent to the serial run of SGD in Figure 2a and so the number of iterations required to achieve an accuracy of a is the same value Na(b)” [pg. 6, Figure  2(b)]).
Misra, Li, and Moritz are all in the same field of endeavor of training networks using mini-batch stochastic gradient descent. Misra teaches cross-stitch networks for multi-task learning. Li teaches a method for efficient mini-batch training for stochastic optimization. Moritz teaches training deep networks in SPARK. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Misra’s/Li’s teachings by calculating the same value obtained by dividing the data size by the batch size for a plurality of tasks as taught by Moritz. One would have been motivated to make this modification to reduce computational time of large-scale models. [pg. 1, §1. Introduction, ¶1, Moritz]

	Regarding claim 2, Misra teaches A learning device for training a single neural network (“This paper proposes cross-stitch units, using which a single network can capture all these Split-architectures (and more). It automatically learns an optimal combination of shared and task-specific representations.” [pg. 3995, right col, top para]) for a plurality of tasks of different types using learning data whose data size varies from task to task (“In order to demonstrate the robustness and effectiveness of cross-stitch units in multi-task learning, we choose varied tasks on multiple datasets. In particular, we select four well established and diverse tasks on different types of image datasets:” [pg. 3996, ¶2]), comprising:
a processor to execute a program; and 
a memory to store the program which, when executed by the processor (pg. 3999, Section 6, "Experiments"; pg.3995, col. 2, para. 3, "ConvNets in computer vision"). (It is clear that Misra performs their method on a computer), performs multi- task training of the neural network using stochastic gradient descent (“We fine-tune the network for semantic segmentation for 25k iterations using SGD” [pg. 3998, left col, ¶1]) by performing processes of, 
accepting, as input, corresponding learning data, having a corresponding data size, for each of the plurality of tasks (pg. 3997, Image input in Fig. 4; pg. 3996, col. 2, Section 3.3, para. 1, “Consider a case of multi task learning with two tasks A and B on the same input image.”);
sampling, for each of the plurality of tasks, samples from the corresponding learning data with the calculated corresponding batch size (pg. 3998, col. 1, para. 1, “We fine-tune the network for semantic segmentation for 25k iterations using SGD (mini-batch size 20) and for surface normal prediction for 15k iterations(mini-batch size 20)”; pg. 3999, col. 1, Section 6, para. 1, "Fast-RCNN carefully constructs mini-batches with 1 : 3 foreground-to-background ratio, i.e., at most 25% of foreground samples in a mini-batch."; pg. 3999, col. 1, Section 6, para. 1 – col.2 para. 1, “we use the same mini-batch sampling strategy; and in every mini-batch only the fore-ground samples contribute to the attribute loss (and back-ground samples are ignored)”); and 
updating a corresponding weight of a discriminator of the neural network for each of the plurality of tasks, using the samples sampled (pg. 3997, col. 2, Section 5, “We then finetune the network (referred to as one-task network) from ImageNet [9] for each task”; pg. 3998, col. 1, para. 1, “We fine-tune the network for semantic segmentation for 25k iterations using SGD (mini-batch size 20) and for surface normal prediction for 15k iterations(mini-batch size 20)). (1. Finetuning the neural network is interpreted as updating the weights of the neural network. Fine tuning means taking the weights of a trained neural network and updating the weights for further training of a model. See further, pg. 3998, §5.2, ¶1: “While training, we found that the gradient updates at various layers had magnitudes which were reasonable for updating the layer parameters, but too small for the cross-stitch units. Thus, we use higher learning rates for the cross-stitch units than the base network. In practice, this leads to faster convergence and better performance.” 2. Semantic segmentation and normal prediction are both examples discriminating between different objects/orientations in the scene.).
Although Misra discloses the batch sizes are the same, the reference doesn’t go into details of dividing the data size of the corresponding learning data for a task by the corresponding batch size for the task
Li teaches calculating, for each of the plurality of tasks, a respective batch size whose ratio to the corresponding data size for the task (“We begin with a brief review of a naive variant of minibatch SGD. During training it processes a group of examples per iteration. For notational simplicity, assume that n is divisible by the number of mini-batches m. Then we partition the examples into m mini-batches, each of size b = n/m. Note that this assumption is not required neither for the proof nor for the implementation. Likewise, the pre-partitioning step is also not necessary in practice, however, it simplifies the exposition of what follows.” [pg. 662, § 2.1 Mini-Batch Stochastic Gradient Descent, ¶1]) 
Misra and Li are both in the same field of endeavor of training networks using mini-batch stochastic gradient descent. Misra teaches cross-stitch networks for multi-task learning. Li teaches a method for efficient mini-batch training for stochastic optimization. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Misra’s multi-task learning method by implementing the mini-batch stochastic gradient descent method as taught by Li. One would have been motivated to make this modification to reduce the communication cost for minibatch training. [Abstract, Li]
	The combination of Misra and Li fails to explicitly teach the values obtained by dividing the data size by the batch size has a fixed value.
	Moritz teaches calculating, a fixed value obtained by dividing the data size by the batch size between the plurality of tasks (“This figure depicts a parallel run of SGD on K = 4 machines under a naive parallelization scheme. At each iteration, each batch of size b is divided among the K machines, the gradients over the subsets are computed separately on each machine, the updates are aggregated, and the new model is broadcast to the workers. Algorithmically, this approach is exactly equivalent to the serial run of SGD in Figure 2a and so the number of iterations required to achieve an accuracy of a is the same value Na(b)” [pg. 6, Figure  2(b)]).
Misra, Li, and Moritz are all in the same field of endeavor of training networks using mini-batch stochastic gradient descent. Misra teaches cross-stitch networks for multi-task learning. Li teaches a method for efficient mini-batch training for stochastic optimization. Moritz teaches training deep networks in SPARK. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Misra’s/Li’s teachings by calculating the same value obtained by dividing the data size by the batch size for a plurality of tasks as taught by Moritz. One would have been motivated to make this modification to reduce computational time of large-scale models. [pg. 1, §1. Introduction, ¶1, Moritz]

Regarding claim 3, 
Misra teaches a signal processing device comprising:
an input information processor (pg. 3999, Section 6, "Experiments"; pg.3995, col. 2, para. 3, "ConvNets in computer vision") (It is clear that Misra performs their method on a computer, in which a processor, i.e., input information processor is inherent).
to accept input of input information; (pg. 3997, Image input in Fig. 4; pg. 3996, col. 2, Section 3.3, para. 1, “Consider a case of multi task learning with two tasks A and B on the same input image.”) and
a discriminator to perform a discrimination process using the input information accepted by the input information processor, (pg. 3997, col. 2, Fig. 4).
the discriminator being caused to learn (pg. 3997, col. 2, Section 4, para. 1, “We use the cross-stitch unit for multi-task learning in ConvNets”; pg. 3994, Abtract, “These units combine the activations from multiple networks”).  by the learning device according to claim 1. (the rejection of claim 1 is incorporated).  

Regarding claim 4, 
Misra teaches a signal processing device comprising:
an input information processor (pg. 3999, Section 6, "Experiments"; pg.3995, col. 2, para. 3, "ConvNets in computer vision") (It is clear that Misra performs their method on a computer, in which a processor, i.e., input information processor is inherent).
to accept input of input information; (pg. 3997, Image input in Fig. 4; pg. 3996, col. 2, Section 3.3, para. 1, “Consider a case of multi task learning with two tasks A and B on the same input image.”) and
a discriminator to perform a discrimination process using the input information accepted by the input information processor, (pg. 3997, col. 2, Fig. 4).
the discriminator being caused to learn (pg. 3997, col. 2, Section 4, para. 1, “We use the cross-stitch unit for multi-task learning in ConvNets”; pg. 3994, Abtract, “These units combine the activations from multiple networks”). by the learning device according to claim 2. (the rejection of claim 2 is incorporated).

Regarding claim 5, 
Misra teaches a learning method of performing multi-task training of a single neural network using stochastic gradient descent (“This paper proposes cross-stitch units, using which a single network can capture all these Split-architectures (and more). It automatically learns an optimal combination of shared and task-specific representations.” [pg. 3995, right col, top para]) for a plurality of tasks of different types using learning data whose data size varies from task to task (“In order to demonstrate the robustness and effectiveness of cross-stitch units in multi-task learning, we choose varied tasks on multiple datasets. In particular, we select four well established and diverse tasks on different types of image datasets:” [pg. 3996, ¶2]), the method comprising:
accepting, as input, corresponding learning data, having a corresponding data size for each of the plurality of tasks, (pg. 3997, Image input in Fig. 4; pg. 3996, col. 2, Section 3.3, para. 1, “Consider a case of multi task learning with two tasks A and B on the same input image.”) and 
sampling, for each of the plurality of tasks, samples from the corresponding learning data with the calculated corresponding batch size; (pg. 3998, col. 1, para. 1, “We fine-tune the network for semantic segmentation for 25k iterations using SGD (mini-batch size 20) and for surface normal prediction for 15k iterations(mini-batch size 20)”; pg. 3999, col. 1, Section 6, para. 1, "Fast-RCNN carefully constructs mini-batches with 1 : 3 foreground-to-background ratio, i.e., at most 25% of foreground samples in a mini-batch."; pg. 3999, col. 1, Section 6, para. 1 – col.2 para. 1, “we use the same mini-batch sampling strategy; and in every mini-batch only the fore-ground samples contribute to the attribute loss (and back-ground samples are ignored)”). and
updating, a corresponding weight of a discriminator of the neural network for each of the plurality of tasks, using the samples sampled. (pg. 3997, col. 2, Section 5, “We then finetune the network (referred to as one-task network) from ImageNet [9] for each task”; pg. 3998, col. 1, para. 1, “We fine-tune the network for semantic segmentation for 25k iterations using SGD (mini-batch size 20) and for surface normal prediction for 15k iterations(mini-batch size 20)). (1. Finetuning the neural network is interpreted as updating the weights of the neural network. Fine tuning means taking the weights of a trained neural network and updating the weights for further training of a model. See further, pg. 3998, §5.2, ¶1: “While training, we found that the gradient updates at various layers had magnitudes which were reasonable for updating the layer parameters, but too small for the cross-stitch units. Thus, we use higher learning rates for the cross-stitch units than the base network. In practice, this leads to faster convergence and better performance.” 2. Semantic segmentation and normal prediction are both examples discriminating between different objects/orientations in the scene.).
Although Misra discloses the batch sizes are the same, the reference doesn’t go into details of dividing the data size of the corresponding learning data for a task by the corresponding batch size for the task
Li teaches calculating, for each of the plurality of tasks, a corresponding batch size which meets a condition that a value obtained by dividing the data size of the corresponding learning data for a task by the corresponding batch size for the task (“We begin with a brief review of a naive variant of minibatch SGD. During training it processes a group of examples per iteration. For notational simplicity, assume that n is divisible by the number of mini-batches m. Then we partition the examples into m mini-batches, each of size b = n/m. Note that this assumption is not required neither for the proof nor for the implementation. Likewise, the pre-partitioning step is also not necessary in practice, however, it simplifies the exposition of what follows.” [pg. 662, § 2.1 Mini-Batch Stochastic Gradient Descent, ¶1]) 
Misra and Li are both in the same field of endeavor of training networks using mini-batch stochastic gradient descent. Misra teaches cross-stitch networks for multi-task learning. Li teaches a method for efficient mini-batch training for stochastic optimization. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Misra’s multi-task learning method by implementing the mini-batch stochastic gradient descent method as taught by Li. One would have been motivated to make this modification to reduce the communication cost for minibatch training. [Abstract, Li]
	The combination of Misra and Li fails to explicitly teach the values obtained by dividing the data size by the batch size are the same.
	Moritz teaches calculating, a value obtained by dividing the data size by the batch size are the same (“This figure depicts a parallel run of SGD on K = 4 machines under a naive parallelization scheme. At each iteration, each batch of size b is divided among the K machines, the gradients over the subsets are computed separately on each machine, the updates are aggregated, and the new model is broadcast to the workers. Algorithmically, this approach is exactly equivalent to the serial run of SGD in Figure 2a and so the number of iterations required to achieve an accuracy of a is the same value Na(b)” [pg. 6, Figure  2(b)]).
Misra, Li, and Moritz are all in the same field of endeavor of training networks using mini-batch stochastic gradient descent. Misra teaches cross-stitch networks for multi-task learning. Li teaches a method for efficient mini-batch training for stochastic optimization. Moritz teaches training deep networks in SPARK. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Misra’s/Li’s teachings by calculating the same value obtained by dividing the data size by the batch size for a plurality of tasks as taught by Moritz. One would have been motivated to make this modification to reduce computational time of large-scale models. [pg. 1, §1. Introduction, ¶1, Moritz]

Regarding claim 6, Misra teaches A learning method of performing multi task training of a single neural network (“This paper proposes cross-stitch units, using which a single network can capture all these Split-architectures (and more). It automatically learns an optimal combination of shared and task-specific representations.” [pg. 3995, right col, top para]) using stochastic gradient descent (“We fine-tune the network for semantic segmentation for 25k iterations using SGD” [pg. 3998, left col, ¶1]) for a plurality of tasks of different types using learning data whose data size varies from task to task (“In order to demonstrate the robustness and effectiveness of cross-stitch units in multi-task learning, we choose varied tasks on multiple datasets. In particular, we select four well established and diverse tasks on different types of image datasets:” [pg. 3996, ¶2]), comprising:
accepting, as input, corresponding learning data, having a corresponding data size, for each of the plurality of tasks (pg. 3997, Image input in Fig. 4; pg. 3996, col. 2, Section 3.3, para. 1, “Consider a case of multi task learning with two tasks A and B on the same input image.”);
sampling, for each of the plurality of tasks, samples from the corresponding learning data with the calculated corresponding batch size (pg. 3998, col. 1, para. 1, “We fine-tune the network for semantic segmentation for 25k iterations using SGD (mini-batch size 20) and for surface normal prediction for 15k iterations(mini-batch size 20)”; pg. 3999, col. 1, Section 6, para. 1, "Fast-RCNN carefully constructs mini-batches with 1 : 3 foreground-to-background ratio, i.e., at most 25% of foreground samples in a mini-batch."; pg. 3999, col. 1, Section 6, para. 1 – col.2 para. 1, “we use the same mini-batch sampling strategy; and in every mini-batch only the fore-ground samples contribute to the attribute loss (and back-ground samples are ignored)”); and 
updating a corresponding weight of a discriminator of the neural network for each of the plurality of tasks, using the samples sampled (pg. 3997, col. 2, Section 5, “We then finetune the network (referred to as one-task network) from ImageNet [9] for each task”; pg. 3998, col. 1, para. 1, “We fine-tune the network for semantic segmentation for 25k iterations using SGD (mini-batch size 20) and for surface normal prediction for 15k iterations(mini-batch size 20)). (1. Finetuning the neural network is interpreted as updating the weights of the neural network. Fine tuning means taking the weights of a trained neural network and updating the weights for further training of a model. See further, pg. 3998, §5.2, ¶1: “While training, we found that the gradient updates at various layers had magnitudes which were reasonable for updating the layer parameters, but too small for the cross-stitch units. Thus, we use higher learning rates for the cross-stitch units than the base network. In practice, this leads to faster convergence and better performance.” 2. Semantic segmentation and normal prediction are both examples discriminating between different objects/orientations in the scene.).
Although Misra discloses the batch sizes are the same, the reference doesn’t go into details of dividing the data size of the corresponding learning data for a task by the corresponding batch size for the task
Li teaches calculating, for each of the plurality of tasks, a respective batch size whose ratio to the corresponding data size for the task (“We begin with a brief review of a naive variant of minibatch SGD. During training it processes a group of examples per iteration. For notational simplicity, assume that n is divisible by the number of mini-batches m. Then we partition the examples into m mini-batches, each of size b = n/m. Note that this assumption is not required neither for the proof nor for the implementation. Likewise, the pre-partitioning step is also not necessary in practice, however, it simplifies the exposition of what follows.” [pg. 662, § 2.1 Mini-Batch Stochastic Gradient Descent, ¶1]) 
Misra and Li are both in the same field of endeavor of training networks using mini-batch stochastic gradient descent. Misra teaches cross-stitch networks for multi-task learning. Li teaches a method for efficient mini-batch training for stochastic optimization. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Misra’s multi-task learning method by implementing the mini-batch stochastic gradient descent method as taught by Li. One would have been motivated to make this modification to reduce the communication cost for minibatch training. [Abstract, Li]
	The combination of Misra and Li fails to explicitly teach the values obtained by dividing the data size by the batch size has a fixed value.
	Moritz teaches calculating, a fixed value obtained by dividing the data size by the batch size between the plurality of tasks (“This figure depicts a parallel run of SGD on K = 4 machines under a naive parallelization scheme. At each iteration, each batch of size b is divided among the K machines, the gradients over the subsets are computed separately on each machine, the updates are aggregated, and the new model is broadcast to the workers. Algorithmically, this approach is exactly equivalent to the serial run of SGD in Figure 2a and so the number of iterations required to achieve an accuracy of a is the same value Na(b)” [pg. 6, Figure  2(b)]).
Misra, Li, and Moritz are all in the same field of endeavor of training networks using mini-batch stochastic gradient descent. Misra teaches cross-stitch networks for multi-task learning. Li teaches a method for efficient mini-batch training for stochastic optimization. Moritz teaches training deep networks in SPARK. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Misra’s/Li’s teachings by calculating the same value obtained by dividing the data size by the batch size for a plurality of tasks as taught by Moritz. One would have been motivated to make this modification to reduce computational time of large-scale models. [pg. 1, §1. Introduction, ¶1, Moritz]

Response to Arguments

	Regarding the 35 U.S.C §101 Rejection:
Applicant’s arguments on pgs. 5-6 regarding the 101 rejection has been considered but are not persuasive. Applicant’s amendments, specifically, regarding the limitation of: “performing multi-task training of a single neural network using stochastic gradient descent…” is considered to be an insignificant extra-solution activity under Step 2A Prong 2. The examiner has provided Berkheimer analysis as evidence to show that this limitation is well-understood, routine, and conventional under Step 2B. Please see the updated 101 rejection above. 

Regarding the 35 U.S.C §102 Rejection:
Applicant’s arguments, see pg. 6, filed 02/03/2022, with respect to the rejections of claims 1-6 under 35 U.S.C. §102(a)(1) have been fully considered and are persuasive.  The prior art of Misra does not appear to disclose that the values obtained by dividing the data size and batch size are the same or that the ratio is of the batch size to the corresponding data size has a fixed value. Therefore, the rejection has been withdrawn.  However, upon further consideration, a new grounds of rejection is made in view of the newly presented arts of Li and Moritz. Please see the updated prior art rejection above. 
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Ravindran et al. ("US 20160259994 A1") discloses on [¶0035] dividing a number of training images by batch size is the total number of iterations in one epoch.

Li et al. (“Heterogeneous multi-task learning for human pose estimation with deep convolutional neural network”) discloses multi-task learning with multiple regression/classification tasks. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL H HOANG whose telephone number is (571)272-8491. The examiner can normally be reached Mon-Fri 8:30AM-4:30PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on (571) 272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/M.H.H./Examiner, Art Unit 2122                                                                                                                                                                                                        

/ERIC NILSSON/Primary Examiner, Art Unit 2122