DETAILED ACTION
1.	This office action is in response to the Application No.  filed on 04/11/2019. Claims 1-20 are presented for examination and are currently pending.

Notice of Pre-AIA  or AIA  Status
2.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.



5.	Claims 1-4, 7-9, 15-17,19 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Lee et al. (US20190311202 filed on 4/10/2018) in view of Song (Deep Neural Network for Learning to Rank Query-Text Pairs, arXiv:1802.08988v1, 25 Feb 2018) 

	Regarding claim 1, Lee teaches a method (In some embodiments, the Siamese neural network is trained using a two-stage training process [0031]), comprising:
(…receiving by one or more processing devices [0086]) and 
	a plurality of signals representing a first plurality of ANN output pairs uniquely associated with a first label and not associated with a second label different from the first label (a reference frame Fi and the corresponding ground-truth mask Mi are used as a reference frame and mask 1120…Similarly, reference frame and mask 1120 is processed by encoder subnetwork 1122 [0081] (as first ANN) Figure 11; encoder neural networks (e.g., “encoders”) [0038]; segmentation data includes a set of labels, such as pairwise labels (e.g., labels having a value indicating “yes” or “no”) indicating whether a given pixel in the image is part of an image region depicting a human figure. In some cases, labels have multiple available values, such as a set of labels indicating whether a given pixel depicts, for example, a human figure, an animal figure, or a background region [0035])
	receiving, at the processor (…receiving by one or more processing devices [0086]) 
	substantially concurrently with the first plurality of ANN output pairs, a signal representing a second plurality of ANN output pairs uniquely associated with the second label and not associated with the first label (target video frame Fj and the mask Mj−1 for target video frame Fj−1 are used as a target frame and guidance mask  1110…Target frame and guidance mask 1110 is processed by encoder subnetwork 1112 (as second ANN) [0081] Figure 11; encoder neural networks (e.g., “encoders”) [0038]; segmentation data includes a set of labels, such as pairwise labels (e.g., labels having a value indicating “yes” or “no”) indicating whether a given pixel in the image is part of an image region depicting a human figure. In some cases, labels have multiple available values, such as a set of labels indicating whether a given pixel depicts, for example, a human figure, an animal figure, or a background region [0035])
	solving, at the processor (The execution of such instructions configures the computer system to perform the specific operations shown in the figures and described herein [0104]), 
	a first activation function based on the first plurality of ANN output pairs to produce a first solved activation function (An element-wise ReLU function y=max(0, x) is applied to the feature maps. A max-pooling with, for example, a 2×2 window and stride 2, is then performed on the outputs of the ReLU function. [0066]);
	solving, at the processor (As stored, the instructions represent programmable modules that include code or data executable by a processor(s) of the computer system. The execution of such instructions configures the computer system to perform the specific operations shown in the figures and described herein [0104]), 
	a second activation function based on the second plurality of ANN output pairs to produce a second solved activation function (An element-wise ReLU function y=max(0, x) is applied to the feature maps. A max-pooling with, for example, a 2×2 window and stride 2, is then performed on the outputs of the ReLU function. [0066]);
	calculating, at the processor (As stored, the instructions represent programmable modules that include code or data executable by a processor(s) of the computer system. The execution of such instructions configures the computer system to perform the specific operations shown in the figures and described herein [0104]), 
	loss values based on the first solved activation function and the second solved activation function (loss function 1124 [0081] calculates loss values);
	generating, at the processor (As stored, the instructions represent programmable modules that include code or data executable by a processor(s) of the computer system. The execution of such instructions configures the computer system to perform the specific operations shown in the figures and described herein [0104]),
	a mask based on at least one ground truth label; and transmitting a signal, including a representation of the mask, from the processor to each of the first plurality of ANNs and the second plurality of ANNs (As described above, in some embodiments, decoder network 1114 includes a Softmax layer for generating estimated mask Mj 1116… Estimated mask Mj 1116 for video frame Fj can be compared with the ground-truth mask for video frame Fj to determine a loss function 1124, which is back-propagated through the neural network to fine tune the parameters of the neural network [0081]; The parameters of neural network 210 can be determined by, for example, back propagation of errors or loss values between pixel values of a training segmentation mask and pixel values of a segmentation mask generated by neural network 210 for a same training video frame or training image [0048]) 
	such that the first plurality of ANNs and the second plurality of ANNs collectively refine a ranking model hosted by the first plurality of ANNs and the second plurality of ANNs (Process 1100 can simulate such error accumulation. In process 1100, during each recursion, estimated mask (or Softmax output) for a previous video frame is used as the guidance mask for the current video frame. Thus, the uncertainty of the estimation is preserved and the errors can be accumulated as in the real inference scenario. This allows the use of back-propagation-through-time (BPTT) for training the recurrently-connected network. [0080])
	Lee does not explicitly teach a first plurality of artificial neural networks (ANNs), a second plurality of ANNs different from the first plurality of ANNs,
	Song teaches a first plurality of artificial neural networks (ANNs), a second plurality of ANNs different from the first plurality of ANNs, (the encoder consists of three sub-networks sharing the same weights, pg. 4, 3.1, Fig. 1)  
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Lee to incorporate the teachings of Song for the benefit of directly fitting raw query-document pairs to an existing feature-based ranking model (Song, pg. 2, first para.)

	Regarding claim 2, Modified Lee teaches the method of claim 1, Lee teaches wherein at least one of the first activation function or the second activation function includes a softmax function (decoder network 1114 includes a Softmax layer (which applies softmax activation function) [0081]; decoder neural networks (e.g., “decoders”) [0038]);

	Regarding claim 3, Modified Lee teaches the method of claim 1, Lee teaches wherein the calculating the loss values is performed using cross-entropy (In some implementations, the fully connected layers at the end of the network is removed, and a pixel-wise sigmoid balanced cross entropy is inserted to classify each pixel into foreground or background [0052])

	Regarding claim 4, Modified Lee teaches the method of claim 1, Lee teaches wherein calculating the loss values is further based on a ground truth associated with at least one of the first label or the second label (Estimated mask Mj 1116 for video frame Fj can be compared with the ground-truth mask for video frame Fj to determine a loss function 1124, which is back-propagated through the neural network to fine tune the parameters of the neural network [0081])

	Regarding claim 7, Modified Lee teaches the method of claim 1, Lee teaches wherein at least one of the first plurality of ANNs or the second plurality of ANNs includes a feed-forward ANN (encoder subnetwork 1122 [0081] (as first ANN) Figure 11; encoder neural networks (e.g., “encoders”) [0038]); At block 1420, a first encoder of the neural network encodes the target frame and the prior segmentation mask into a first feature map. As described above with respect to, for example, FIGS. 6 and 7, the first encoder includes multiple layers, such as multiple convolution layers (feed-forward neural network), activation layers, and pooling layers. In one example, the encoder is a part of a Siamese encoder network [0099])

	Regarding claim 8, Modified Lee teaches the method of claim 1, Lee teaches 
(Residual block 850 is a feedforward neural network that includes a residual mapping path [0069] (multilayer perceptron, MLP, is a feedforward neural network))

	Regarding claim 9, Modified Lee teaches the method of claim 1, Lee teaches wherein at least one of the first plurality of ANNs or the second plurality of ANNs includes a convolution network (CN) (encoder subnetwork 1122 [0081] (as first ANN) Figure 11; encoder neural networks (e.g., “encoders”) [0038]); At block 1420, a first encoder of the neural network encodes the target frame and the prior segmentation mask into a first feature map. As described above with respect to, for example, FIGS. 6 and 7, the first encoder includes multiple layers, such as multiple convolution layers, activation layers, and pooling layers. In one example, the encoder is a part of a Siamese encoder network [0099])

	Regarding claim 15, Lee teaches a method, (In some embodiments, the Siamese neural network is trained using a two-stage training process [0031]) comprising:
	receiving, at a processor (…receiving by one or more processing devices [0086]) and 
	from a plurality of artificial neural networks (ANNs), a plurality of signals representing an associated plurality of ANN output pairs associated with a label; (a reference frame Fi and the corresponding ground-truth mask Mi are used as a reference frame and mask 1120…Similarly, reference frame and mask 1120 is processed by encoder subnetwork 1122 [0081] (as first ANN) Figure 11; encoder neural networks (e.g., “encoders”) [0038]; segmentation data includes a set of labels, such as pairwise labels (e.g., labels having a value indicating “yes” or “no”) indicating whether a given pixel in the image is part of an image region depicting a human figure. In some cases, labels have multiple available values, such as a set of labels indicating whether a given pixel depicts, for example, a human figure, an animal figure, or a background region [0035])
	calculating, at the processor, (As stored, the instructions represent programmable modules that include code or data executable by a processor(s) of the computer system. The execution of such instructions configures the computer system to perform the specific operations shown in the figures and described herein [0104]), 
	loss values based on the plurality of ANN output pairs; (loss function 1124 [0081] calculates loss values);
	defining, at the processor, a mask based on at least one ground truth label; and
transmitting a signal, including a representation of the mask, from the processor to each ANN from the plurality of ANNs, (As described above, in some embodiments, decoder network 1114 includes a Softmax layer for generating estimated mask Mj 1116… Estimated mask Mj 1116 for video frame Fj can be compared with the ground-truth mask for video frame Fj to determine a loss function 1124, which is back-propagated through the neural network to fine tune the parameters of the neural network [0081]; The parameters of neural network 210 can be determined by, for example, back propagation of errors or loss values between pixel values of a training segmentation mask and pixel values of a segmentation mask generated by neural network 210 for a same training video frame or training image [0048]) 
	to update a ranking model of the plurality of ANNs. (Process 1100 can simulate such error accumulation. In process 1100, during each recursion, estimated mask (or Softmax output) for a previous video frame is used as the guidance mask for the current video frame. Thus, the uncertainty of the estimation is preserved and the errors can be accumulated as in the real inference scenario. This allows the use of back-propagation-through-time (BPTT) for training the recurrently-connected network. [0080]; Updates to neural network 210 can be pushed or pulled from server computer 205 [0049])
	Lee does not explicitly teach plurality of artificial neural networks (ANNs)
	Song teaches plurality of artificial neural networks (ANNs) (the encoder consists of three sub-networks sharing the same weights, pg. 4, 3.1, Fig. 1)  
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Lee to incorporate the teachings of Song for the benefit of directly fitting raw query-document pairs to an existing feature-based ranking model (Song, pg. 2, first para.)

	Regarding claim 16, Lee teaches the method of claim 15, Lee teaches wherein the plurality of signals representing the associated plurality of ANN output pairs associated with the label is a first plurality of signals, the mask is a first mask, and the signal is a first signal, (a reference frame Fi and the corresponding ground-truth mask Mi are used as a reference frame and mask 1120…Similarly, reference frame and mask 1120 is processed by encoder subnetwork 1122 [0081])
	the method further comprising: receiving, at the processor, (…receiving by one or more processing devices [0086]) 
	from the plurality of ANNs, and after the first plurality of signals, a second plurality of signals representing an associated plurality of ANN output pairs associated with the label; defining, at the processor and after the first mask, a second mask based on the second plurality of signals; (target video frame Fj and the mask Mj−1 for target video frame Fj−1 are used as a target frame and guidance mask 1110…Target frame and guidance mask 1110 is processed by encoder subnetwork 1112 [0081]) and
	transmitting a second signal, including the second mask, from the processor to each ANN from the plurality of ANNs, for further refinement of the ranking model of the plurality of ANNs (As described above, in some embodiments, decoder network 1114 includes a Softmax layer for generating estimated mask Mj 1116… Estimated mask Mj 1116 for video frame Fj can be compared with the ground-truth mask for video frame Fj to determine a loss function 1124, which is back-propagated through the neural network to fine tune the parameters of the neural network [0081]; The parameters of neural network 210 can be determined by, for example, back propagation of errors or loss values between pixel values of a training segmentation mask and pixel values of a segmentation mask generated by neural network 210 for a same training video frame or training image [0048])

(Residual block 850 is a feedforward neural network that includes a residual mapping path [0069] (multilayer perceptron, MLP, is a feedforward neural network))

	Regarding claim 19, Modified Lee teaches the method of claim 15, Lee teaches wherein each ANN output pair from the plurality of ANN output pairs is generated by an associated multilayer perceptron (MLP) (Residual block 850 is a feedforward neural network that includes a residual mapping path [0069] (multilayer perceptron, MLP, is a feedforward neural network))
	
	Regarding claim 20, Modified Lee teaches the method of claim 15, Lee teaches wherein the plurality of ANN output pairs is associated with at least two different labels (segmentation data includes a set of labels, such as pairwise labels (e.g., labels having a value indicating “yes” or “no”) .., labels have multiple available values, such as a set of labels indicating whether a given pixel depicts, for example, a human figure, an animal figure, or a background region [0035])

6.	Claims 5, 6,10-14 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Lee et al. (US20190311202 filed on 4/10/2018) in view of Song (Deep Neural Network for Learning to Rank Query-Text Pairs, arXiv:1802.08988v1, 25 Feb 2018) and further in view of Varior et al ("Gated siamese convolutional neural network architecture for human re-identification." European conference on computer vision. Springer, Cham, 2016.)

	Regarding claim 5, Modified Lee teaches the method of claim 1, Lee teaches wherein the generating the mask (As described above, in some embodiments, decoder network 1114 includes a Softmax layer for generating estimated mask Mj 1116… Estimated mask Mj 1116 for video frame Fj can be compared with the ground-truth mask for video frame Fj to determine a loss function 1124, which is back-propagated through the neural network to fine tune the parameters of the neural network [0081]; 
	Song teaches the setting the indication being in response to detecting, at the processor, a lack of preference between outputs of at least one of an output pair from the first plurality of ANN output pairs or an output pair from the second plurality of ANN output pairs (we denote Q the set of queries and D the set of documents. Given q ∈ D, note Dq ⊂ D the set of documents which “match”2 q. For di, dj ∈ Dq, we write di ≻ dj if di is more relevance than dj (di ≺ dj is defined similarly) and di ∼ dj if there is a tie (as no preference) Note further p : Dq × Dq → {−1, 0, 1} the pairwise preference such that 
p (di, dj) = −1, if di ≺ dj 
                     0, if di ∼ dj 
                   +1, if di ≻ dj
pg. 3, 3. ConvRankNet)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Lee to incorporate the teachings of Song for the benefit of directly fitting raw query-document pairs to an existing feature-based ranking model (Song, pg. 2, first para.)
	Modified Lee does not explicitly teach includes setting an indication that a portion of the mask will not cause an adjustment to a label weighting for at least one of an output pair from the first plurality of ANN output pairs or an output pair from the second plurality of ANN output pairs, 
	Varior teaches includes setting an indication that a portion of the mask will not cause an adjustment to a label weighting for at least one of an output pair from the first plurality of ANN output pairs or an output pair from the second plurality of ANN output pairs, (The proposed matching gate (MG) receives input activations from the previous convolutional block, compares the local features along a horizontal stripe and outputs a gating mask indicating how much more emphasis should be paid to each of the local patterns, pg. 796, Matching Gate; Computing the distance between each dimension is important as the gating function must have the flexibility to smoothly turn ‘on’ or turn ‘off’ each of the extracted patterns in the feature map, pg. 797, last para.; The network learns to identify the optimal p for each dimension from the training data which results in a matching gate that is flexible in its functioning. Alongside learning an optimal p, the network also learns the parameter w and the bias in Eq. (1) to summarize the features along a horizontal stripe, pg. 798, last para.)
(Varior, pg. 798, last para.)

	Regarding claim 6, Modified Lee teaches the method of claim 1, Lee teaches wherein the generating the mask (As described above, in some embodiments, decoder network 1114 includes a Softmax layer for generating estimated mask Mj 1116… Estimated mask Mj 1116 for video frame Fj can be compared with the ground-truth mask for video frame Fj to determine a loss function 1124, which is back-propagated through the neural network to fine tune the parameters of the neural network [0081]); 
	Song teaches the setting the indication being in response to detecting, at the processor, a preference between outputs of at least one of an output pair from the first plurality of ANN output pairs or an output pair from the second plurality of ANN output pairs (we denote Q the set of queries and D the set of documents. Given q ∈ D, note Dq ⊂ D the set of documents which “match”2 q. For di, dj ∈ Dq, we write di ≻ dj if di is more relevance than dj (di ≺ dj is defined similarly) and di ∼ dj if there is a tie (as no preference) Note further p : Dq × Dq → {−1, 0, 1} the pairwise preference such that 
p(di , dj ) = −1, if di ≺ dj 
                     0, if di ∼ dj 
                   +1, if di ≻ dj
pg. 3, 3. ConvRankNet)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Lee to incorporate the teachings of Song for the benefit of directly fitting raw query-document pairs to an existing feature-based ranking model (Song, pg. 2, first para.)
	Modified Lee does not explicitly teach includes setting an indication that a portion of the mask will cause an adjustment to a label weighting for at least one of an output pair from the first plurality of ANN output pairs or an output pair from the second plurality of ANN output pairs,
	Varior teaches includes setting an indication that a portion of the mask will cause an adjustment to a label weighting for at least one of an output pair from the first plurality of ANN output pairs or an output pair from the second plurality of ANN output pairs, (The proposed matching gate (MG) receives input activations from the previous convolutional block, compares the local features along a horizontal stripe and outputs a gating mask indicating how much more emphasis should be paid to each of the local patterns, pg. 796, Matching Gate; Computing the distance between each dimension is important as the gating function must have the flexibility to smoothly turn ‘on’ or turn ‘off’ each of the extracted patterns in the feature map, pg. 797, last para.; The network learns to identify the optimal p for each dimension from the training data which results in a matching gate that is flexible in its functioning. Alongside learning an optimal p, the network also learns the parameter w and the bias in Eq. (1) to summarize the features along a horizontal stripe, pg. 798, last para.)
	The same motivation to combine as dependent claim 5 applies here.

	Regarding claim 10, Lee teaches an apparatus, comprising: a processor; and
a memory operably coupled to the processor and storing processor-executable instructions to: (Computer system 1900 includes at least a processor 1902, a memory 1904, a storage device 1906 [0124[, Fig. 19)
	receive, at the processor, (…receiving by one or more processing devices [0086]) 
	a plurality of artificial neural network (ANN) output pairs, each ANN output pair from the plurality of ANN output pairs associated with a different label from a plurality of labels; (target video frame Fj and the mask Mj−1 for target video frame Fj−1 are used as a target frame and guidance mask 1110…Target frame and guidance mask 1110 is processed by encoder subnetwork 1112 (as second ANN) [0081] Figure 11; encoder neural networks (e.g., “encoders”) [0038]; segmentation data includes a set of labels, such as pairwise labels (e.g., labels having a value indicating “yes” or “no”) indicating whether a given pixel in the image is part of an image region depicting a human figure. In some cases, labels have multiple available values, such as a set of labels indicating whether a given pixel depicts, for example, a human figure, an animal figure, or a background region [0035])
	generate, at the processor, (As stored, the instructions represent programmable modules that include code or data executable by a processor(s) of the computer system. The execution of such instructions configures the computer system to perform the specific operations shown in the figures and described herein [0104]),
	a mask based on the plurality of ANN output pairs, the generating including:
for each ANN output pair from the plurality of ANN output pairs: (As described above, in some embodiments, decoder network 1114 includes a Softmax layer for generating estimated mask Mj 1116… Estimated mask Mj 1116 for video frame Fj can be compared with the ground-truth mask for video frame Fj to determine a loss function 1124, which is back-propagated through the neural network to fine tune the parameters of the neural network [0081]; The parameters of neural network 210 can be determined by, for example, back propagation of errors or loss values between pixel values of a training segmentation mask and pixel values of a segmentation mask generated by neural network 210 for a same training video frame or training image [0048]) 
	transmit a signal, including the mask, from the processor to each of the first ANN and the second ANN, such that the first ANN and the second ANN collectively (As described above, in some embodiments, decoder network 1114 includes a Softmax layer for generating estimated mask Mj 1116… Estimated mask Mj 1116 for video frame Fj can be compared with the ground-truth mask for video frame Fj to determine a loss function 1124, which is back-propagated through the neural network to fine tune the parameters of the neural network [0081]; The parameters of neural network 210 can be determined by, for example, back propagation of errors or loss values between pixel values of a training segmentation mask and pixel values of a segmentation mask generated by neural network 210 for a same training video frame or training image [0048]) 
	update a ranking model hosted by the first ANN and the second ANN (Process 1100 can simulate such error accumulation. In process 1100, during each recursion, estimated mask (or Softmax output) for a previous video frame is used as the guidance mask for the current video frame. Thus, the uncertainty of the estimation is preserved and the errors can be accumulated as in the real inference scenario. This allows the use of back-propagation-through-time (BPTT) for training the recurrently-connected network. [0080]; Updates to neural network 210 can be pushed or pulled from server computer 205 [0049])
	Lee does not explicitly teach detecting whether a first ANN output of that ANN output pair is preferred over a second ANN output of that ANN output pair; in response to detecting a lack of preference between the first ANN output and the second ANN output, setting an indication that a portion of the mask will not cause an adjustment to a label weighting for that ANN output pair, and in response to detecting a preference between the first ANN output and the second ANN output, setting an indication that the portion of the mask will cause an adjustment to the label weighting for that ANN output pair; and
	Song teaches detecting whether a first ANN output of that ANN output pair is preferred over a second ANN output of that ANN output pair in response to detecting a lack of preference between the first ANN output and the second ANN output pair (we denote Q the set of queries and D the set of documents. Given q ∈ D, note Dq ⊂ D the set of documents which “match”2 q. For di, dj ∈ Dq, we write di ≻ dj if di is more relevance than dj (di ≺ dj is defined similarly) and di ∼ dj if there is a tie (as no preference) Note further p : Dq × Dq → {−1, 0, 1} the pairwise preference such that 
p(di , dj ) = −1, if di ≺ dj 
                     0, if di ∼ dj 
                   +1, if di ≻ dj
pg. 3, 3. ConvRankNet)
	in response to detecting a preference between the first ANN output and the second ANN output (we denote Q the set of queries and D the set of documents. Given q ∈ D, note Dq ⊂ D the set of documents which “match”2 q. For di, dj ∈ Dq, we write di ≻ dj if di is more relevance than dj (di ≺ dj is defined similarly) and di ∼ dj if there is a tie (as no preference) Note further p : Dq × Dq → {−1, 0, 1} the pairwise preference such that 
p(di , dj ) = −1, if di ≺ dj 
                     0, if di ∼ dj 
                   +1, if di ≻ dj
pg. 3, 3. ConvRankNet)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Lee to incorporate the teachings of Song for the benefit of directly fitting raw query-document pairs to an existing feature-based ranking model (Song, pg. 2, first para.)
	Varior teaches setting an indication that a portion of the mask will not cause an adjustment to a label weighting for that ANN output pair (The proposed matching gate (MG) receives input activations from the previous convolutional block, compares the local features along a horizontal stripe and outputs a gating mask indicating how much more emphasis should be paid to each of the local patterns, pg. 796, Matching Gate; Computing the distance between each dimension is important as the gating function must have the flexibility to smoothly turn ‘on’ or turn ‘off’ each of the extracted patterns in the feature map, pg. 797, last para.; The network learns to identify the optimal p for each dimension from the training data which results in a matching gate that is flexible in its functioning. Alongside learning an optimal p, the network also learns the parameter w and the bias in Eq. (1) to summarize the features along a horizontal stripe, pg. 798, last para.)
	setting an indication that the portion of the mask will cause an adjustment to the label weighting for that ANN output pair; (The proposed matching gate (MG) receives input activations from the previous convolutional block, compares the local features along a horizontal stripe and outputs a gating mask indicating how much more emphasis should be paid to each of the local patterns, pg. 796, Matching Gate; Computing the distance between each dimension is important as the gating function must have the flexibility to smoothly turn ‘on’ or turn ‘off’ each of the extracted patterns in the feature map, pg. 797, last para.; The network learns to identify the optimal p for each dimension from the training data which results in a matching gate that is flexible in its functioning. Alongside learning an optimal p, the network also learns the parameter w and the bias in Eq. (1) to summarize the features along a horizontal stripe, pg. 798, last para.)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Lee (Varior, pg. 798, last para.)

	Regarding claim 11, Modified Lee teaches the apparatus of claim 10, Lee teaches the apparatus of claim 10, wherein the generating the mask includes solving activation functions for each ANN output pair from the plurality of ANN output pairs. (As described above, in some embodiments, decoder network 1114 includes a Softmax layer for generating estimated mask Mj 1116… Estimated mask Mj 1116 for video frame Fj can be compared with the ground-truth mask for video frame Fj to determine a loss function 1124, which is back-propagated through the neural network to fine tune the parameters of the neural network [0081]; (An element-wise ReLU function y=max(0, x) is applied to the feature maps. A max-pooling with, for example, a 2×2 window and stride 2, is then performed on the outputs of the ReLU function. [0066])

	Regarding claim 12, Modified Lee teaches the apparatus of claim 10, Lee teaches wherein the generating the mask is based on a ground truth associated with a label from the plurality of labels (As described above, in some embodiments, decoder network 1114  includes a Softmax layer for generating estimated mask Mj 1116… Estimated mask Mj 1116 for video frame Fj can be compared with the ground-truth mask for video frame Fj to determine a loss function 1124, which is back-propagated through the neural network to fine tune the parameters of the neural network [0081])

(Feature maps 730 extracted by the reference frame encoder subnetwork from the reference frame and the ground-truth mask, and feature maps 732 extracted by the target frame encoder subnetwork from the target frame and the estimated mask for the previous mask are combined, such as concatenated along the channel axis or by pixel-wise summation, and fed to global convolution block 740. Global convolution block 740 performs global feature matching between the reference frame and the target frame to localize the target object in the target frame [0067]; the first encoder and the second encoder form a Siamese encoder network [0100])

	Regarding claim 14, Modified Lee teaches the apparatus of claim 10, Modified Lee teaches the apparatus of claim 10, Lee teaches wherein each label from the plurality of labels (segmentation data includes a set of labels, such as pairwise labels (e.g., labels having a value indicating “yes” or “no”) .., labels have multiple available values, such as a set of labels indicating whether a given pixel depicts, for example, a human figure, an animal figure, or a background region [0035])
	Song teaches is associated with a portion of a contract (Siamese Convolutional
Neural Network (CNN) encoder, a module designed to, given query q and two documents di, dj, extract automatically feature vectors Φ(q, di) and Φ(q, dj ) and (2) RankNet, a successful three-layer neural network-based pairwise ranking Model, pg. 2, last para.)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Lee to incorporate the teachings of Song for the benefit of directly fitting raw query-document pairs to an existing feature-based ranking model (Song, pg. 2, first para.

	Regarding claim 18, Modified Lee teaches the method of claim 15, Lee teaches wherein the generating the mask (As described above, in some embodiments, decoder network 1114 includes a Softmax layer for generating estimated mask Mj 1116… Estimated mask Mj 1116 for video frame Fj can be compared with the ground-truth mask for video frame Fj to determine a loss function 1124, which is back-propagated through the neural network to fine tune the parameters of the neural network [0081]; 
	Modified Lee does not explicitly teach includes defining a portion of the mask such that no adjustment to a weighting of the label is applied
	Varior teaches includes defining a portion of the mask such that no adjustment to a weighting of the label is applied (The proposed matching gate (MG) receives input activations from the previous convolutional block, compares the local features along a horizontal stripe and outputs a gating mask indicating how much more emphasis should be paid to each of the local patterns, pg. 796, Matching Gate; Computing the distance between each dimension is important as the gating function must have the flexibility to smoothly turn ‘on’ or turn ‘off’ each of the extracted patterns in the feature map, pg. 797, last para.; The network learns to identify the optimal p for each dimension from the training data which results in a matching gate that is flexible in its functioning. Alongside learning an optimal p, the network also learns the parameter w and the bias in Eq. (1) to summarize the features along a horizontal stripe, pg. 798, last para.)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Lee to incorporate the teachings of Varior for the benefit of facilitating end-to-end learning strategy in deep networks (Varior, pg. 798, last para.)

Conclusion
	Any inquiry concerning this communication or earlier communications from the examiner should be directed to MORIAM MOSUNMOLA GODO whose telephone number is (571)272-8670. The examiner can normally be reached Monday-Friday 7:30am-5:30pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B. Zhen can be reached on (571)272-3768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, 


/M.G./Examiner, Art Unit 2121                     



/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121