DETAILED ACTION
This action is in response to the claims filed 09/11/2020 for application 17/018,555. Claims 1-20 are currently pending. 

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 7, 9, 11, and 18 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

Claims 7 and 18 recite the limitation "a particular neural network layer of the neural network". The claim is unclear since the neural network could be interpreted as the same neural network or a separate, different neural network. Thus, one of ordinary skill in the art cannot make the determination of which neural network is applying the noise function or if the noise function is applied to the same neural network. For purposes of examination, the examiner will interpret the neural network as being a second neural network different from the first machine learning model. Therefore, the claim has an indefinite scope. 
Claim 9 recites the limitation “modifying attributes…of the neural network to train the second machine-learning model”. Similarly, to claim 7, the claim is unclear since the neural network could be interpreted as the same neural network or a separate, different neural network. Thus, one of ordinary skill in the art cannot make the determination of which neural network or if the same neural network is performing the modifying step. For purposes of examination, the examiner will interpret the neural network as being a second neural network different from the first machine learning model. Therefore, the claim has an indefinite scope.
Claim 11 recites the limitation "the first neural network and the second neural network".  There is insufficient antecedent basis for this limitation in the claim.


Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claims 1-7, 9-18, and 20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Zhu et al. ("Improving Semantic Segmentation via Self-Training, hereinafter "Zhu").

Regarding claim 1, Zhu teaches A computer-implemented method (§3.3 discloses use of GPU machines) comprising: 
obtaining data specifying a trained first machine-learning model that has been trained on labeled data, wherein each of the labeled data and the first machine-learning model are un-noised when the first machine-learning model is trained (“We present an overview of our self-training framework in Fig. 1. Given a small quantity of labeled training samples (an image and a human-annotated segmentation mask), we first train a teacher model with standard cross-entropy loss” [pg. 4, §3.1, ¶2; Teacher model corresponds to the first machine-learning model]) and the first machine-learning model is a neural network (“Table 3. Our self-training method can improve the student models irrespective of backbones and network architectures” [pg. 10, Table 3 discloses “Teacher model is a DeepLab network” which is a CNN model]); 
generating first pseudo labeled data by generating a respective pseudo label for each of a plurality of items of unlabeled data by processing the items of unlabeled data using the trained first machine-learning model (“We then use the teacher model to generate pseudo labels on a large number of unlabeled images. The better the teacher model is, the higher the quality of the generated pseudo labels can be. As can be seen in Fig. 1, our teacher-generated pseudo labels have a good quality that are close to human annotations. More visualizations of pseudo labels can be found in the Appendix.” [pg. 4, §3.1, ¶3]); and 
training a second machine-learning model on a first combined dataset, wherein the first combined dataset comprises the labeled data and the first pseudo labeled data (“Finally, we train a student model using both human-annotated labels (real labels) and teacher-generated labels (pseudo labels).” [pg. 4, §3.1, ¶4, See further: Fig. 1, student model corresponds to a second machine learning model]) and the second machine-learning model is a neural network (“To be specific, we take our well-trained student model (DeepLabV3+ with WideResNet38 backbone), and finetune it on target datasets to report the performance.” [pg. 13, §4.4, ¶1; DeepLab is a CNN model]), the training comprising: 
during the training, adding noise to the second machine-learning model, comprising (i) modifying attributes of one or more items in the first combined data set (“We employ the SGD optimizer for all the experiments. We set the initial learning rate to 0.02 for training from scratch and 0.002 for finetuning. We use a polynomial learning rate policy [38], where the initial learning rate is multiplied by (1− epoch max epoch ) power with a power of 0.9. Momentum and weight decay are set to 0.9 and 0.0001 respectively. Synchronized batch normalization [76] is used with a default batch size of 16. Using our fast training schedule, the batch size can increase to 64 when the crop size is smaller. The number of training epochs is set to 180 for both Cityscapes and Mapillary, 80 for CamVid and 50 for KITTI. The crop size is set to 800 for both Cityscapes and Mapillary, 640 for CamVid and 368 for KITTI due to different image resolutions. For data augmentation, we perform random spatial scaling (from 0.5 to 2.0), horizontal flipping, Gaussian blur and color jittering (0.1) during training.” [pg. 8, § 4.2, ¶1]), (ii) modifying operations performed by the second machine-learning model (“Hence, we propose to warm-up the crop size in the early epochs. We use a large crop size of 800 in the first 20 epochs, and then switch to the fast training schedules. We can see that by using coarse2fine+ with crop size warm-up, our fast training schedule is able to match the performance of baseline with 1.7x speed up. Note that, the speed up is model dependent. We can enjoy 2x speed up when training with a heavier model using the WideResNet38 backbone.” [pg. 11, Fast training schedules, ¶1; Fast training would be equivalent to modifying operations performed by a second model]), or (iii) both (note: The claim under BRI requires only one of the limitations as the limitation recites “or”, however examiner has provided a corresponding citation for both.).

Regarding claim 2, Zhu teaches The method of claim 1, comprising: generating second pseudo labeled data by generating a respective pseudo label for each of the plurality of items of unlabeled data by processing the items of unlabeled data using the trained second machine-learning model (“Teacher-student learning could be iterative which means we can use the student as teacher, generate more accurate pseudo labels” [pg. 21,§  A.3, ¶1]); and 
training a third machine-learning model on a second combined dataset that includes the labeled data and the second pseudo labeled data (“and then retrain another student model. Here, we use more loops of self-training to see if helps semantic segmentation. As seen in Table 9, using a single-loop of teacher-student training is able to achieve promising results (80.0%). 2-loop obtains slightly worse results (79.9%), and 3-loop is slightly better (80.2%). In terms of a good trade-off between accuracy and resources, we only perform a single iteration of teacher-student for all experiments.” [pg. 21, § A.3, ¶1-2]).

Regarding claim 3, Zhu teaches The method of claim 2, wherein training the second machine-learning model comprises: 
training a machine-learning model that has a respective model size that is larger than a respective model size of the first machine-learning model that has been trained on the labeled data (“Generalizing to other students Self-training is model-agnostic. It is a way to increase the number of training samples, and improve the accuracy and robustness of model itself. Here we would like to show that the pseudo labels generated by our teacher model (DeepLabV3+ with ResNeXt50 backbone), can improve the performance of (1) a heavier model (DeepLabV3+ with WideResNet38 backbone [59]); (2) a fast model (FastSCNN [49]) and (3) another widely adopted segmentation model (PSPNet with ResNet101 backbone [76])... As shown in Table 3, our self-training method can improve the student model irrespective of the backbones and network architectures, which demonstrates its great generalization capability. We want to emphasize again that for all three students, our results are not only better than their comparing baseline, but also outperforms the models pre-trained on Mapillary labeled data. In addition, our trained FastSCNN model achieves an mIoU score of 72.5% on the Cityscapes validation set, with only 1.1M parameters. ” [pg. 11, Generalizing to other students, ¶1-2]).

Regarding claim 4, Zhu teaches The method of claim 2, wherein training the second machine-learning model comprises: 
training one or more subsequent versions of the second machine-learning model; and increasing a respective size of each subsequent version of the second machine- learning model, relative to a respective size of a corresponding prior version of the second machine-learning model that preceded the subsequent version (“Here, we use more loops of self-training to see if helps semantic segmentation [pg. 21, § A.3, ¶1-2]… As seen in Table 10, larger models tend to benefit more from the fast training schedule. For example, we achieve 2x speed up when training on a DeepLabV3+ model with WideResNet38 backbone. [pg. 22, top para, Table 10 uses different sized student models which are subsequently larger than the previous student model.]).

Regarding claim 5, Zhu teaches The method of claim 4, wherein training the third machine-learning model comprises: training the third machine-learning model based on each of the subsequent versions of the second machine-learning model (“Teacher-student learning could be iterative which means we can use the student as teacher, generate more accurate pseudo labels and then retrain another student model.” [pg. 21, § A.3, ¶1]).

Regarding claim 6, Zhu teaches The method of claim 4, wherein training the third machine-learning model comprises: adding noise to the third machine-learning model by modifying attributes of one or more items in the second combined data set using a noise function (“We employ the SGD optimizer for all the experiments… Using our fast training schedule, the batch size can increase to 64 when the crop size is smaller. The number of training epochs is set to 180 for both Cityscapes and Mapillary, 80 for CamVid and 50 for KITTI. The crop size is set to 800 for both Cityscapes and Mapillary, 640 for CamVid and 368 for KITTI due to different image resolutions. For data augmentation, we perform random spatial scaling (from 0.5 to 2.0), horizontal flipping, Gaussian blur and color jittering (0.1) during training.” [pg. 8, § 4.2 Implementation Details, ¶1; data augmentation corresponds to “adding noise” by using a noise “function” as it modifies an image by spatial scaling/blurring/jittering.  note: Zhu discloses on pg. 21, § A.3 retraining a second student model (i.e. third machine learning model), therefore adding noise would be inherent.])
	
Regarding claim 7, Zhu teaches The method of claim 4, wherein training the second machine-learning model comprises: 
during the training, applying a noise function to a particular neural network layer of the neural network that is used to implement the second machine-learning model; 
adding noise to the second machine-learning model based on the noise function applied to the particular neural network layer; and 
modifying operations performed by the second machine-learning model as a result of adding the noise to the second machine-learning model (“We employ the SGD optimizer for all the experiments… Using our fast training schedule, the batch size can increase to 64 when the crop size is smaller. The number of training epochs is set to 180 for both Cityscapes and Mapillary, 80 for CamVid and 50 for KITTI. The crop size is set to 800 for both Cityscapes and Mapillary, 640 for CamVid and 368 for KITTI due to different image resolutions. For data augmentation, we perform random spatial scaling (from 0.5 to 2.0), horizontal flipping, Gaussian blur and color jittering (0.1) during training.” [pg. 8, § 4.2 Implementation Details, ¶1; Zhu uses data augmentation to apply noise to the input layer (therefore input data has noise applied to it) which modifies the model (i.e. adds noise to the model) during training])

Regarding claim 9, Zhu teaches The method of claim 1, wherein modifying attributes of the one or more items in the first combined dataset comprises: modifying attributes of the one or more items in the first combined dataset to inject noise into the first combined dataset concurrent with processing the one or more items through layers of the neural network to train the second machine-learning model, wherein the neural network is used to implement the second machine-learning model (“We employ the SGD optimizer for all the experiments. We set the initial learning rate to 0.02 for training from scratch and 0.002 for finetuning. We use a polynomial learning rate policy [38], where the initial learning rate is multiplied by (1− epoch max epoch ) power with a power of 0.9. Momentum and weight decay are set to 0.9 and 0.0001 respectively. Synchronized batch normalization [76] is used with a default batch size of 16. Using our fast training schedule, the batch size can increase to 64 when the crop size is smaller. The number of training epochs is set to 180 for both Cityscapes and Mapillary, 80 for CamVid and 50 for KITTI. The crop size is set to 800 for both Cityscapes and Mapillary, 640 for CamVid and 368 for KITTI due to different image resolutions. For data augmentation, we perform random spatial scaling (from 0.5 to 2.0), horizontal flipping, Gaussian blur and color jittering (0.1) during training.” [pg. 8, § 4.2, ¶1]).

Regarding claim 10, Zhu teaches The method of claim 4, wherein:
the first machine-learning model is implemented using a teacher neural network model (“We present an overview of our self-training framework in Fig. 1. Given a small quantity of labeled training samples (an image and a human-annotated segmentation mask), we first train a teacher model with standard cross-entropy loss” [pg. 4, § 3.1, ¶1]); 
the second machine-learning model represents a first version of a student neural network model (“Finally, we train a student model using both human-annotated labels (real labels) and teacher-generated labels (pseudo labels).” [pg. 4, § 3.1, ¶4]); and
 the third machine-learning model represents a second, different version of a student neural network model (“Teacher-student learning could be iterative which means we can use the student as teacher, generate more accurate pseudo labels and then retrain another student model” [pg. 21, § A.3, ¶1]).
Regarding claim 11, Zhu teaches The method of claim 10, wherein the first neural network and the second neural network have the same neural network architecture (“Zhu discloses previous teacher-student frameworks commonly use the same neural network architectures: “Note that the teacher-student framework is widely studied in the literature of distillation, however, it has been reported in [21] to have the limitation that teacher and student are expected to have a similar architecture to work well. We would like to point out that with our approach, the student model may have a different network architecture which does not have to be the same as that of the teacher’s [pg. 5, top para]).

Regarding claim 12, Zhu teaches A system comprising: 
one or more processing devices; and 
one or more non-transitory machine-readable storage devices storing instructions that are executable by the one or more processing devices to cause performance of operations comprising (§3.3 discloses use of GPU machines and memory):
obtaining data specifying a trained first machine-learning model that has been trained on labeled data, wherein each of the labeled data and the first machine-learning model are un-noised when the first machine-learning model is trained (“We present an overview of our self-training framework in Fig. 1. Given a small quantity of labeled training samples (an image and a human-annotated segmentation mask), we first train a teacher model with standard cross-entropy loss” [pg. 4, §3.1, ¶2; Teacher model corresponds to the first machine-learning model]) and the first machine-learning model is a neural network (“Table 3. Our self-training method can improve the student models irrespective of backbones and network architectures” [pg. 10, Table 3 discloses “Teacher model is a DeepLab network” which is a CNN model]); 
generating first pseudo labeled data by generating a respective pseudo label for each of a plurality of items of unlabeled data by processing the items of unlabeled data using the trained first machine-learning model (“We then use the teacher model to generate pseudo labels on a large number of unlabeled images. The better the teacher model is, the higher the quality of the generated pseudo labels can be. As can be seen in Fig. 1, our teacher-generated pseudo labels have a good quality that are close to human annotations. More visualizations of pseudo labels can be found in the Appendix.” [pg. 4, §3.1, ¶3]); and 
training a second machine-learning model on a first combined dataset, wherein the first combined dataset comprises the labeled data and the first pseudo labeled data (“Finally, we train a student model using both human-annotated labels (real labels) and teacher-generated labels (pseudo labels).” [pg. 4, §3.1, ¶4, See further: Fig. 1, student model corresponds to a second machine learning model]) and the second machine-learning model is a neural network (“To be specific, we take our well-trained student model (DeepLabV3+ with WideResNet38 backbone), and finetune it on target datasets to report the performance.” [pg. 13, §4.4, ¶1; DeepLab is a CNN model]), the training comprising: 
during the training, adding noise to the second machine-learning model, comprising (i) modifying attributes of one or more items in the first combined data set (“We employ the SGD optimizer for all the experiments. We set the initial learning rate to 0.02 for training from scratch and 0.002 for finetuning. We use a polynomial learning rate policy [38], where the initial learning rate is multiplied by (1− epoch max epoch ) power with a power of 0.9. Momentum and weight decay are set to 0.9 and 0.0001 respectively. Synchronized batch normalization [76] is used with a default batch size of 16. Using our fast training schedule, the batch size can increase to 64 when the crop size is smaller. The number of training epochs is set to 180 for both Cityscapes and Mapillary, 80 for CamVid and 50 for KITTI. The crop size is set to 800 for both Cityscapes and Mapillary, 640 for CamVid and 368 for KITTI due to different image resolutions. For data augmentation, we perform random spatial scaling (from 0.5 to 2.0), horizontal flipping, Gaussian blur and color jittering (0.1) during training.” [pg. 8, § 4.2, ¶1]), (ii) modifying operations performed by the second machine-learning model (“Hence, we propose to warm-up the crop size in the early epochs. We use a large crop size of 800 in the first 20 epochs, and then switch to the fast training schedules. We can see that by using coarse2fine+ with crop size warm-up, our fast training schedule is able to match the performance of baseline with 1.7x speed up. Note that, the speed up is model dependent. We can enjoy 2x speed up when training with a heavier model using the WideResNet38 backbone.” [pg. 11, Fast training schedules, ¶1; Fast training would be equivalent to modifying operations performed by a second model]), or (iii) both (note: The claim under BRI requires only one of the limitations as the limitation recites “or”, however examiner has provided a corresponding citation for both.).

Regarding claim 13, Zhu teaches The system of claim 12, comprising: generating second pseudo labeled data by generating a respective pseudo label for each of the plurality of items of unlabeled data by processing the items of unlabeled data using the trained second machine-learning model (“Teacher-student learning could be iterative which means we can use the student as teacher, generate more accurate pseudo labels” [pg. 21,§  A.3, ¶1]); and 
training a third machine-learning model on a second combined dataset that includes the labeled data and the second pseudo labeled data (“and then retrain another student model. Here, we use more loops of self-training to see if helps semantic segmentation. As seen in Table 9, using a single-loop of teacher-student training is able to achieve promising results (80.0%). 2-loop obtains slightly worse results (79.9%), and 3-loop is slightly better (80.2%). In terms of a good trade-off between accuracy and resources, we only perform a single iteration of teacher-student for all experiments.” [pg. 21, § A.3, ¶1-2]).

Regarding claim 14, Zhu teaches The system of claim 13, wherein training the second machine-learning model comprises: 
training a machine-learning model that has a respective model size that is larger than a respective model size of the first machine-learning model that has been trained on the labeled data (“As seen in Table 10, larger models tend to benefit more from the fast training schedule. For example, we achieve 2x speed up when training on a DeepLabV3+ model with WideResNet38 backbone. This is because when the model is bigger, the time spent on network computation dominates the training time. If we reduce the crop size, we save a lot of computation.” [pg. 22, top para]).

Regarding claim 15, Zhu teaches The system of claim 13, wherein training the second machine-learning model comprises: 
training one or more subsequent versions of the second machine-learning model; and increasing a respective size of each subsequent version of the second machine- learning model, relative to a respective size of a corresponding prior version of the second machine-learning model that preceded the subsequent version (“Generalizing to other students Self-training is model-agnostic. It is a way to increase the number of training samples, and improve the accuracy and robustness of model itself. Here we would like to show that the pseudo labels generated by our teacher model (DeepLabV3+ with ResNeXt50 backbone), can improve the performance of (1) a heavier model (DeepLabV3+ with WideResNet38 backbone [59]); (2) a fast model (FastSCNN [49]) and (3) another widely adopted segmentation model (PSPNet with ResNet101 backbone [76])... As shown in Table 3, our self-training method can improve the student model irrespective of the backbones and network architectures, which demonstrates its great generalization capability. We want to emphasize again that for all three students, our results are not only better than their comparing baseline, but also outperforms the models pre-trained on Mapillary labeled data. In addition, our trained FastSCNN model achieves an mIoU score of 72.5% on the Cityscapes validation set, with only 1.1M parameters. ” [pg. 11, Generalizing to other students, ¶1-2]).
Regarding claim 16, Zhu teaches The system of claim 15, wherein training the third machine-learning model comprises: training the third machine-learning model based on each of the subsequent versions of the second machine-learning model (“Teacher-student learning could be iterative which means we can use the student as teacher, generate more accurate pseudo labels and then retrain another student model.” [pg. 21, § A.3, ¶1]).

Regarding claim 17, Zhu teaches The system of claim 15, wherein training the third machine-learning model comprises: adding noise to the third machine-learning model by modifying attributes of one or more items in the second combined data set using a noise function (“We employ the SGD optimizer for all the experiments… Using our fast training schedule, the batch size can increase to 64 when the crop size is smaller. The number of training epochs is set to 180 for both Cityscapes and Mapillary, 80 for CamVid and 50 for KITTI. The crop size is set to 800 for both Cityscapes and Mapillary, 640 for CamVid and 368 for KITTI due to different image resolutions. For data augmentation, we perform random spatial scaling (from 0.5 to 2.0), horizontal flipping, Gaussian blur and color jittering (0.1) during training.” [pg. 8, § 4.2 Implementation Details, ¶1; data augmentation corresponds to “adding noise” by using a noise “function” as it modifies an image by spatial scaling/blurring/jittering.  note: Zhu discloses on pg. 21, § A.3 retraining a second student model (i.e. third machine learning model), therefore adding noise would be inherent.])
	
Regarding claim 18, Zhu teaches The system of claim 15, wherein training the second machine-learning model comprises: 
during the training, applying a noise function to a particular neural network layer of the neural network that is used to implement the second machine-learning model; 
adding noise to the second machine-learning model based on the noise function applied to the particular neural network layer; and 
modifying operations performed by the second machine-learning model as a result of adding the noise to the second machine-learning model (“We employ the SGD optimizer for all the experiments… Using our fast training schedule, the batch size can increase to 64 when the crop size is smaller. The number of training epochs is set to 180 for both Cityscapes and Mapillary, 80 for CamVid and 50 for KITTI. The crop size is set to 800 for both Cityscapes and Mapillary, 640 for CamVid and 368 for KITTI due to different image resolutions. For data augmentation, we perform random spatial scaling (from 0.5 to 2.0), horizontal flipping, Gaussian blur and color jittering (0.1) during training.” [pg. 8, § 4.2 Implementation Details, ¶1; Zhu uses data augmentation to apply noise to the input layer (therefore input data has noise applied to it) which modifies the model (i.e. adds noise to the model) during training])


Regarding claim 20, Zhu teaches One or more non-transitory machine-readable storage devices storing instructions that are executable by one or more processing devices to cause performance of operations comprising (§3.3 discloses use of GPU machines and memory):
obtaining data specifying a trained first machine-learning model that has been trained on labeled data, wherein each of the labeled data and the first machine-learning model are un-noised when the first machine-learning model is trained (“We present an overview of our self-training framework in Fig. 1. Given a small quantity of labeled training samples (an image and a human-annotated segmentation mask), we first train a teacher model with standard cross-entropy loss” [pg. 4, §3.1, ¶2; Teacher model corresponds to the first machine-learning model]) and the first machine-learning model is a neural network (“Table 3. Our self-training method can improve the student models irrespective of backbones and network architectures” [pg. 10, Table 3 discloses “Teacher model is a DeepLab network” which is a CNN model]); 
generating first pseudo labeled data by generating a respective pseudo label for each of a plurality of items of unlabeled data by processing the items of unlabeled data using the trained first machine-learning model (“We then use the teacher model to generate pseudo labels on a large number of unlabeled images. The better the teacher model is, the higher the quality of the generated pseudo labels can be. As can be seen in Fig. 1, our teacher-generated pseudo labels have a good quality that are close to human annotations. More visualizations of pseudo labels can be found in the Appendix.” [pg. 4, §3.1, ¶3]); and 
training a second machine-learning model on a first combined dataset, wherein the first combined dataset comprises the labeled data and the first pseudo labeled data (“Finally, we train a student model using both human-annotated labels (real labels) and teacher-generated labels (pseudo labels).” [pg. 4, §3.1, ¶4, See further: Fig. 1, student model corresponds to a second machine learning model]) and the second machine-learning model is a neural network (“To be specific, we take our well-trained student model (DeepLabV3+ with WideResNet38 backbone), and finetune it on target datasets to report the performance.” [pg. 13, §4.4, ¶1; DeepLab is a CNN model]), the training comprising: 
during the training, adding noise to the second machine-learning model, comprising (i) modifying attributes of one or more items in the first combined data set (“We employ the SGD optimizer for all the experiments. We set the initial learning rate to 0.02 for training from scratch and 0.002 for finetuning. We use a polynomial learning rate policy [38], where the initial learning rate is multiplied by (1− epoch max epoch ) power with a power of 0.9. Momentum and weight decay are set to 0.9 and 0.0001 respectively. Synchronized batch normalization [76] is used with a default batch size of 16. Using our fast training schedule, the batch size can increase to 64 when the crop size is smaller. The number of training epochs is set to 180 for both Cityscapes and Mapillary, 80 for CamVid and 50 for KITTI. The crop size is set to 800 for both Cityscapes and Mapillary, 640 for CamVid and 368 for KITTI due to different image resolutions. For data augmentation, we perform random spatial scaling (from 0.5 to 2.0), horizontal flipping, Gaussian blur and color jittering (0.1) during training.” [pg. 8, § 4.2, ¶1]), (ii) modifying operations performed by the second machine-learning model (“Hence, we propose to warm-up the crop size in the early epochs. We use a large crop size of 800 in the first 20 epochs, and then switch to the fast training schedules. We can see that by using coarse2fine+ with crop size warm-up, our fast training schedule is able to match the performance of baseline with 1.7x speed up. Note that, the speed up is model dependent. We can enjoy 2x speed up when training with a heavier model using the WideResNet38 backbone.” [pg. 11, Fast training schedules, ¶1; Fast training would be equivalent to modifying operations performed by a second model]), or (iii) both (note: The claim under BRI requires only one of the limitations as the limitation recites “or”, however examiner has provided a corresponding citation for both.).


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 8 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Zhu in view of Ouali et al. ("An Overview of Deep Semi-Supervised Learning", hereinafter "Ouali").

Regarding claim 8, Zhu teaches The method of claim 1, however fails to explicitly discloses wherein generating the respective pseudo label for each of the plurality of items of unlabeled data comprises: 
generating the respective pseudo label based on a maximum predicted probability for a class that corresponds to a particular item of unlabeled data in response to processing the particular item of unlabeled data using the trained first machine-learning model.
Ouali teaches generating the respective pseudo label based on a maximum predicted probability for a class that corresponds to a particular item of unlabeled data in response to processing the particular item of unlabeled data using the trained first machine-learning model (“Given an output fθ(x) for an unlabeled data point x in the form of a probability distribution over the classes, the pair (x, argmaxfθ(x)) is added to the labeled set if the probability assigned to its most likely class is higher than a predetermined threshold τ. The process of training the model using the augmented labeled set, and then set using it to label the remaining of Du is repeated until the model is incapable of producing confident predictions. Other heuristics can be used to decide which proxy labeled examples to retain, such as using the relative confidence instead of the absolute confidence, where the top n unlabeled samples predicted with the highest confidence after every epoch are added to the labeled training dataset Dl” [pg. 16, § 4.1, ¶1]).
Zhu and Ouali are both in the same field of endeavor deep semi-supervised learning. Zhu discloses a teacher-student self-training method for image segmentation. Ouali discloses an overview of deep semi-supervised learning. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Zhu’s teachings to inject noise in a neural network layer using a noise function as taught by Ouali. The method could be easily adapted with the CNNs of Zhu by replacing the fully connected layers with convolutional layers as disclosed by Ouali. [pg. 7, bottom para, Ouali]

Regarding claim 19, Zhu teaches The system of claim 12, however fails to explicitly discloses wherein generating the respective pseudo label for each of the plurality of items of unlabeled data comprises: 
generating the respective pseudo label based on a maximum predicted probability for a class that corresponds to a particular item of unlabeled data in response to processing the particular item of unlabeled data using the trained first machine-learning model.
Ouali teaches generating the respective pseudo label based on a maximum predicted probability for a class that corresponds to a particular item of unlabeled data in response to processing the particular item of unlabeled data using the trained first machine-learning model (“Given an output fθ(x) for an unlabeled data point x in the form of a probability distribution over the classes, the pair (x, argmaxfθ(x)) is added to the labeled set if the probability assigned to its most likely class is higher than a predetermined threshold τ. The process of training the model using the augmented labeled set, and then set using it to label the remaining of Du is repeated until the model is incapable of producing confident predictions. Other heuristics can be used to decide which proxy labeled examples to retain, such as using the relative confidence instead of the absolute confidence, where the top n unlabeled samples predicted with the highest confidence after every epoch are added to the labeled training dataset Dl” [pg. 16, § 4.1, ¶1]).
Zhu and Ouali are both in the same field of endeavor deep semi-supervised learning. Zhu discloses a teacher-student self-training method for image segmentation. Ouali discloses an overview of deep semi-supervised learning. It would have been obvious to one of ordinary skill in the art before the effective filing date to modify Zhu’s teachings to inject noise in a neural network layer using a noise function as taught by Ouali. The method could be easily adapted with the CNNs of Zhu by replacing the fully connected layers with convolutional layers as disclosed by Ouali. [pg. 7, bottom para, Ouali]

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Kahn et al. (“SELF-TRAINING FOR END-TO-END SPEECH RECOGNITION”) discloses pseudo labeling in self-training for end-to-end speech recognition (See Abstract and §3. Semi-supervised self-training)
Ghamdi et al. (“SEMI-SUPERVISED TRANSFER LEARNING FOR CONVOLUTIONAL NEURAL NETWORKS FOR GLAUCOMA DETECTION”) discloses semi-supervised learning for CNNs. (Abstract)
Radosavovic (“Data Distillation: Towards Omni-Supervised Learning”) teaches an omni-supervised learning method with teacher-student framework (Abstract, pg. 3, left col, bottom para “student model”)
Liu et al. (“US 20200410388 A1”) discloses model training using a teacher-student framework with self-training
Yalniz et al. ("Billion-scale semi-supervised learning for image classification", hereinafter "Yalniz") discloses a self-training teacher-student framework, however does not use a combined dataset with pseudo-labeled data.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MICHAEL H HOANG whose telephone number is (571)272-8491. The examiner can normally be reached Mon-Fri 8:30AM-4:30PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kakali Chaki can be reached on (571) 272-3719. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/M.H.H./           Examiner, Art Unit 2122                                                                                                                                                                                             

/KAKALI CHAKI/           Supervisory Patent Examiner, Art Unit 2122