DETAILED ACTION
1.	This office action is in response to the Application No. 15884125 filed on 07/28/2022. Claims 1-20 are presented for examination and are currently pending.

Notice of Pre-AIA  or AIA  Status
2.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
3. 	Upon further review, the previous office action 04/09/2022 is hereby withdrawn based on the remarks of the Applicant.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

4.	Claims 1-3, 6 and 9-17 are rejected under 35 U.S.C. 103 as being unpatentable over Gehring et al. (Convolutional sequence to sequence learning. International Conference on Machine Learning 2017 Jul 17 (pp. 1243-1252). PMLR.) in view of Wang et al (US10552968) and further in view of Madhavaraj et al. (US10490182 filed 12/29/2016)

	Regarding claim 1, Gehring teaches a method for sequence-to-sequence prediction using a neural network model (In this paper we propose an architecture for sequence to sequence modeling that is entirely convolutional (pg. 1, right col, second to the last para.); We add the conditional inputs computed by the attention (center right) to the decoder states which then predict the target words (bottom right), Fig.  1, pg. 3), comprising: 
	generating an encoded representation based on an input sequence using an encoder of the neural network model and predicting an output sequence based on the encoded representation using a decoder of the neural network model (Both encoder and decoder networks share a simple block structure that computes intermediate states based on a fixed number of input elements. We denote the output of the l-th block as hl = (hl1,..., hln) for the decoder network, and zl = (zl1,...,zlm) for the encoder network, pg. 2, right col, 3.2. Convolutional Block Structure; Position embeddings are useful in our architecture since they give our model a sense of which portion of the sequence in the input or output it is currently dealing with, pg. 2, right col, 3.1. Position Embeddings; We add the conditional inputs computed by the attention (center right) to the decoder states which then predict the target words (bottom right), pg. 3, right col, Fig. 1); 
	wherein the neural network model includes a plurality of model parameters learned according to a machine learning process (We also use weight normalization for all layers except for lookup tables, pg. 5, left col., second para.)
	wherein at least one of the encoder or the decoder includes a branched attention layer (We introduce a separate attention mechanism for each decoder layer. To compute the attention, we combine the current decoder state hli with an embedding of the previous target element gi: … For decoder layer l the attention alij of state i and source element j is computed as a dot-product between the decoder state summary dl i and each output zuj of the last encoder block u:; pg. 3, left col., 3.3. Multi-step Attention) 
	Gehring does not explicitly teach comprising a plurality of branches arranged in parallel, each branch of the branched attention layer including an interdependent scaling node configured to scale an intermediate representation of the branch by a learned scaling parameter, the learned scaling parameter, depending on one or more other learned scaling parameters of one or more other interdependent scaling nodes of one or more other branches of the branched attention layer.
	Wang teaches a branched layer (functional architecture 700, Fig. 7)
	comprising a plurality of branches arranged in parallel, (705, 715, 725, Fig. 7 as first branch and 710, 720, 730, Fig. 7 as second branch)
	each branch of the branched attention layer including an interdependent scaling node (scale attention net 715 (for the first branch) and scale attention net 720 (for the second branch), Fig. 7)
	configured to scale an intermediate representation of the branch (scale attention net 715 multiply various intermediate representations (Fig. 8) such as the feature data x1 840 is then combined with the attention map x1 835 via a multiplication operation 84)
	by a learned scaling parameter, the learned scaling parameter (all combinations of feature and attention data is combined in a weighted sum via addition operation 843 to produce dense feature 847, which is a collection of all dense features for all pixels, col 13, lines 13-16, Fig. 8. The Examiner notes that weight is a learned scaling parameter)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Gehring to incorporate the teachings of Wang for the benefit of tracking one or more objects (e.g., car), moving within images captured in sequence (e.g., video) implemented using one or more convolutional neural networks trained on images of different sizes, where each image of a given size has texture data emphasized via an attention map (Wang, col 2, lines 45-50)
	Madhavaraj teaches depending on one or more other learned scaling parameters of one or more other interdependent scaling nodes of one or more other branches of the branched attention layer (ANN weights 160 (Block 350) depends on ANN weights 160 (Block 310, Fig. 4); Essentially, the weights in the scaled ANN are equivalent to products k1wji 1, col 6, lines 15-16; For example, if the weights at the first layer are multiplied by a scale factor k1 and the weights at the second layer are multiplied by k2, the standard learning rate that would have been used for the unmodified weights is essentially modified to yield equivalent update steps according to the scale factors k1 and k2, col 6, lines 7-12;  Each layer includes a number of nodes, denoted Nn for the number of nodes at layer n, col 2, lines 12-13)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Gehring to incorporate the teachings of Madhavaraj for the benefit of initialization and learning rate adjustment for artificial neural networks (ANNs) (Madhavaraj, col 1, lines 8-9).

	Regarding claim 2, Modified Gehring teaches the method of claim 1, Gehring teaches wherein the at least one of the encoder or the decoder includes one or more additional branched attention layers arranged sequentially with the branched attention layer (Multi-step attention in all five decoder layers, pg. 7, Table 5. “Examiner note: Attn Layers 1,2,3,4,5 are arranged sequentially”)

	Regarding claim 3, Modified Gehring teaches the method of claim 1, Gehring teaches wherein the branched attention layer further includes an aggregation node configured to aggregate a plurality of branch output representations corresponding to each branch of the branched attention layer. (The conditional input cli generated by the attention is a weighted sum of m vectors (2) and we counteract a change in variance through scaling by m p1/m; we multiply by m to scale up the inputs to their original size, assuming the attention scores are uniformly distributed, … For convolutional decoders with multiple attention, we scale the gradients for the encoder layers by the number of attention mechanisms we use; pg. 4, left col., second para.)

	Regarding claim 5, Modified Gehring teaches the method of claim 1, Gehring teaches wherein the learned scaling parameter and the one or more other learned scaling parameters are subject to at least one joint constraint. (The conditional input cl i generated by the attention is a weighted sum of m vectors (2) and we counteract a change in variance through scaling by m p1/m; we multiply by m to scale up the inputs to their original size, assuming the attention scores are uniformly distributed, pg. 4, left col, second para.)

	Regarding claim 6, Modified Gehring teaches the method of claim 1, Gehring teaches wherein the learned scaling parameter and the one or more other learned scaling parameters are values between zero and one and add up to one (Specifically, we pad the input by k-1 elements on both the left and right side by zero vectors, and then remove k elements from the end of the convolution output, pg. 3, left col, first para. and col. 2, lines 19-25 of Madhavaraj)

	Regarding claim 9, Modified Gehring teaches the method of claim 1, Gehring teaches wherein the machine learning process includes projecting the plurality of model parameters onto a constraint set at each training step of the machine learning process (We train our convolutional models with Nesterov’s accelerated gradient method, using a momentum value of 0.99 and renormalize gradients if their norm exceeds 0.1. We use a learning rate of 0.25 and once the validation perplexity stops improving, we reduce the learning rate by an order of magnitude after each epoch until it falls below 10-4, … If the threshold is exceeded, we simply split the batch until the threshold is met and process the parts separately., pg. 5, left col, first para.)

	Regarding claim 10, Modified Gehring teaches the method of claim 1, Gehring teaches wherein the machine learning process includes training the learned scaling parameter and the one or more other learned scaling parameters at a higher learning rate than other model parameters among the plurality of model parameters during a warm-up stage of the machine learning process (We use a learning rate of 0.25 and once the validation perplexity stops improving, we reduce the learning rate by an order of magnitude after each epoch until it falls below 10-4, pg. 5, left col, first para.)
	
	Regarding claim 11, Modified Gehring teaches the method of claim 1, Gehring teaches wherein the machine learning process includes fixing the learned scaling parameter and the one or more other learned scaling parameters during a wind-down stage of the machine learning process. (We use a learning rate of 0.25 and once the validation perplexity stops improving, we reduce the learning rate by an order of magnitude after each epoch until it falls below 10-4, pg. 5, left col, first para.)

	Regarding claim 12, Gehring teaches a system for sequence-to-sequence prediction (In this paper we propose an architecture for sequence to sequence modeling that is entirely convolutional (pg. 1, right col, second to the last para.); We add the conditional inputs computed by the attention (center right) to the decoder states which then predict the target words (bottom right), Fig.  1, pg. 3), comprising:
	an encoder stage that generates an encoded representation based on an input sequence; and a decoder stage that predicts an output sequence based on the encoded representation, ((Both encoder and decoder networks share a simple block structure that computes intermediate states based on a fixed number of input elements. We denote the output of the l-th block as hl = (hl1,..., hln) for the decoder network, and zl = (zl1,...,zlm) for the encoder network, pg. 2, right col, 3.2. Convolutional Block Structure; Position embeddings are useful in our architecture since they give our model a sense of which portion of the sequence in the input or output it is currently dealing with, pg. 2, right col, 3.1. Position Embeddings; We add the conditional inputs computed by the attention (center right) to the decoder states which then predict the target words (bottom right), Fig.  1, pg. 3)
	wherein at least one of the encoder stage and the decoder stage includes a branched attention layer, (We also use attention in every decoder layer and demonstrate that each attention layer only adds a negligible amount of overhead, pg. 1, right col., second to the last para.) 
	an aggregation node that aggregates a plurality of branch output representations generated by each of the plurality of branches (then concatenate their outputs as an input to a feed-forward neural network (FNN), pg. 3, left col, first para.)
	Gehring does not explicitly teach a memory storing a plurality of processor-executable instructions; and a processor reading and executing the processor-executable instructions from the memory to perform operations and comprising: a plurality of branches arranged in parallel, each branch of the branched attention layer including an interdependent scaling node configured to scale an intermediate representation of the branch by a learned scaling parameter, the learned scaling parameter, depending on one or more other learned scaling parameters of one or more other interdependent scaling nodes of one or more other branches among the plurality of branches;
	Wang teaches a memory storing a plurality of processor-executable instructions; and a processor reading and executing the processor-executable instructions from the memory to perform operations (The software architecture 1606 may execute on hardware such as a machine 1600 of FIG. 16 that includes, among other things, processors, memory, … The executable instructions 1604 represent the executable instructions of the software architecture 1606, including implementation of the methods, components, and so forth described herein. The hardware layer 1652 also includes a memory/storage 1656, which also has the executable instructions 1604, col 15, lines 10-22)
	a branched layer (functional architecture 700, Fig. 7)
	including a plurality of branches arranged in parallel, (705, 715, 725, Fig. 7 as first branch and 710, 720, 730, Fig. 7 as second branch)
	each branch of the branched attention layer including an interdependent scaling node (scale attention net 715 (for the first branch) and scale attention net 720 (for the second branch), Fig. 7)
	that scale an intermediate representation of the branch (scale attention net 715 multiply various intermediate representations (Fig. 8) such as the feature data x1 840 is then combined with the attention map x1 835 via a multiplication operation 84)
	by a learned scaling parameter, the learned scaling parameter (all combinations of feature and attention data is combined in a weighted sum via addition operation 843 to produce dense feature 847, which is a collection of all dense features for all pixels, col 13, lines 13-16, Fig. 8. The Examiner notes that weight is a learned scaling parameter)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Gehring to incorporate the teachings of Wang for the benefit of tracking one or more objects (e.g., car), moving within images captured in sequence (e.g., video) implemented using one or more convolutional neural networks trained on images of different sizes, where each image of a given size has texture data emphasized via an attention map (Wang, col 2, lines 45-50)
	Madhavaraj teaches depending on one or more other learned scaling parameters of one or more other interdependent scaling nodes of one or more other branches among the plurality of branches (ANN weights 160 (Block 350) depends on ANN weights 160 (Block 310, Fig. 4); Essentially, the weights in the scaled ANN are equivalent to products k1wji 1, col 6, lines 15-16; For example, if the weights at the first layer are multiplied by a scale factor k1 and the weights at the second layer are multiplied by k2, the standard learning rate that would have been used for the unmodified weights is essentially modified to yield equivalent update steps according to the scale factors k1 and k2, col 6, lines 7-12;  Each layer includes a number of nodes, denoted Nn for the number of nodes at layer n, col 2, lines 12-13)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Gehring to incorporate the teachings of Madhavaraj for the benefit of initialization and learning rate adjustment for artificial neural networks (ANNs) (Madhavaraj, col 1, lines 8-9).

	Regarding claim 13, Modified Gehring teaches the system of claim 12, Gehring teaches wherein the input sequence corresponds to a first text sequence in a first language (The English source sentence is encoded (top), pg. 3, right col, Fig. 1, “Examiner note: English source sentence as first text sequence in a first language”)) and
	the output sequence corresponds to a second text sequence in a second language (we compute all attention values for the four German target words (center) simultaneously, … then predict the target words (bottom right), pg. 3, right col, Fig. 1. “Examiner note: German target words as second text sequence in a second language”)

	Regarding claim 14, Modified Gehring teaches the system of claim 12, Gehring teaches wherein each branch among the plurality of branches further includes a parameterized attention network, the parameterized attention network evaluating a scaled dot-product attention based on a layer input representation (Our attentions are just dot products between decoder context representations (bottom left) and encoder representations. We add the conditional inputs computed by the attention (center right) to the decoder states which then predict the target words (bottom right), pg. 3, right col, Fig. 1)
	
	Regarding claim 15, Modified Gehring teaches the system of claim 14, Wang teaches wherein each branch among the plurality of branches further includes a parameterized transformation network, the parameterized transformation network including a feed-forward neural network (FIG. 10 illustrates an architecture for an attention net 1000, according to some example embodiments. In the example of FIG. 10, the attention net 1000 is implemented as a number of layers of a convolutional neural network. In particular, the attention net comprises an input layer 1005, a shared resnet block layer 1010, resnet block layer 1015, resnet block layer 1020, resnet block layer 1025, a SoftMax norm block layer 1030, and an output layer 1035, col 13, lines 31-38. The Examiner notes that the convolutional neural network is a feedforward neural network)

	Regarding claim 16, Modified Gehring teaches the system of claim 12, Gehring teaches wherein the decoder stage predicts the output sequence iteratively (Our attentions are just dot products between decoder context representations (bottom left) and encoder representations. We add the conditional inputs computed by the attention (center right) to the decoder states which then predict the target words (bottom right), pg. 3, right col, Fig. 1)

	Regarding claim 17, Modified Gehring teaches the system of claim 12, Gehring teaches wherein the learned scaling parameter and the one or more other learned scaling parameters are values between zero and one and add up to one. (Specifically, we pad the input by k-1 elements on both the left and right side by zero vectors, and then remove k elements from the end of the convolution output, pg. 3, left col, first para.)

5.	Claims 7 and 8 are rejected under 35 U.S.C. 103 as being unpatentable over Gehring et al. (Convolutional sequence to sequence learning. International Conference on Machine Learning 2017 Jul 17 (pp. 1243-1252). PMLR.) in view of Wang et al. (US10552968) in view of Madhavaraj et al. (US10490182 filed 12/29/2016)
and further in view of Min et al (US20180124331 filed on 10/26/2017)

	Regarding claim 7, Modified Gehring teaches the method of claim 1, Gehring teaches wherein each branch of the branched attention layer further includes a second interdependent scaling node configured to scale (For convolutional decoders with multiple attention, we scale the gradients for the encoder layers (as nodes) by the number of attention mechanisms we use; we exclude source word embeddings. We found this to stabilize learning since the encoder received too much gradient otherwise (pg. 4 left col., fourth para.); The conditional input generated by the attention is a weighted sum of m vectors (2) and we counteract a change in variance through scaling by m√1/m; we multiply by m to scale up the inputs to their original size, assuming the attention scores are uniformly distributed, pg. 4, left col., third para.)
	Modified Gehring does not explicitly teach a second intermediate representation of the branch by a second learned scaling parameter
	Min teaches a second intermediate representation of the branch by a second learned scaling parameter (At step 515, process the sampled continuous frames, by a pre-trained (or jointly learned) 3D convolutional neural network, to get intermediate feature representations across L convolutional layers and top-layer features [0052]; At step 530, dynamically perform spatiotemporal attention and layer attention to form a context vector [0055]; The attention weights αti and βtl and context vector zt are computed by the following: … [0088])
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Gehring to incorporate the teachings of Min for the benefit of generating a sequence of words dynamically emphasizing different levels (CNN layers) of 3D convolutional features [0018]  using a deep three-dimensional Convolutional Neural Network (C3D) as an encoder for videos and a recurrent neural network (RNN) as a decoder for the captions [0065] and in this way, videos resident on the servers can be translated thereby into a textual representation for indexing, searching, retrieval, analysis (Min, [0026])

	Regarding claim 8, Modified Gehring teaches the method of claim 7, Modified Gehring does not explicitly teach wherein each branch of the branched attention layer further includes a parameterized attention network and a parameterized transformation network, and wherein the intermediate representation corresponds to an output representation of the parameterized attention network and the second intermediate representation corresponds to an output representation of the parameterized transformation network, and wherein the parameterized transformation network receives a scaled representation generated by the interdependent scaling node.
	Min teaches wherein each branch of the branched attention layer further includes a parameterized attention network (For each feature vector ai,l, the attention mechanism 1000 generates two positive weights at time t, with ati=fatt(ai, ht−1) and βtl=fatt(al, ht−1), which measure the relative importance to location i and layer l for producing the next word based on the history word information) and
	 a parameterized transformation network, (we apply a convolutional transformation 1030 to embed each ai,l into the same semantic space, defined as follows: … [0087]) and 
	wherein the intermediate representation corresponds to an output representation of the parameterized attention network (The attention mechanism ϕ(ht−1, a1, . . . , aL) at time step t is now developed. Let ai,lϵ
    PNG
    media_image1.png
    42
    29
    media_image1.png
    Greyscale
n k i correspond to the feature vector extracted from the l-th layer at location i, where iϵ[1, . . . , nf l]×[1, . . . , nx l]×[1, . . . , ny l] indicates a certain cuboid in the input video, and nk l is the number of convolutional filters in the l-th layer of C3D. For each feature vector ai,l, the attention mechanism 1000 generates two positive weights at time t, with ati=fatt(ai, ht−1) and βtl=fatt(al, ht−1), which measure the relative importance to location i and layer l for producing the next word based on the history word information [0085]) and 
	the second intermediate representation corresponds to an output representation of the parameterized transformation network, (we apply a convolutional transformation 1030 to embed each ai,l into the same semantic space, defined as follows:	â l=Σk=1 n k l f(a l *U k l)  (4)
where l=1, . . . , L−1, and âL=aL; symbol * represents the 3-dimensional convolution operator, and f(·) is an element-wise nonlinear activation function with pooling. Uk l of size Of l×Ox l×Oy l×nk L is the learned semantic embedding parameters. In addition, Of l, Ox l and Oy l are chosen such that each âl (for all l) will have the same dimensions of nk L×nf L×nx L×ny L and induce spatiotemporal alignment across features from different layers (indexed by iϵ[1, . . . , nf L]×[1, . . . , nx L]×[1, . . . , ny L])) and
	wherein the parameterized transformation network receives a scaled representation generated by the interdependent scaling node (For each feature vector ai,l, the attention mechanism 1000 generates two positive weights at time t, with ati=fatt(ai, ht−1) and βtl=fatt(al, ht−1), which measure the relative importance to location i and layer l for producing the next word based on the history word information [0085])
	The same motivation to combine dependent claim 7 applies here.

6.	Claims 4, 18 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Gehring et al. (Convolutional sequence to sequence learning. International Conference on Machine Learning 2017 Jul 17 (pp. 1243-1252). PMLR.) in view of Wang et al. (US10552968) in view of Madhavaraj et al. (US10490182 filed 12/29/2016)
and further in view of Hsu et al. (Recurrent Neural Network Encoder with Attention for Community Question Answering, arXiv:1603.07044v1 [cs.CL] 23 Mar 2016)

	Regarding claim 4, Modified Gehring teaches the method of claim 3, Hsu teaches wherein the aggregation node is configured to aggregate the branch output representations by summation. (concatenate their outputs as an input to a feed-forward neural network (FNN), pg. 3, left col, first para.)
	The same motivation to combine as independent claim 1 applies here.

	Regarding claim 18, Gehring teaches a non-transitory machine-readable medium having stored thereon a machine translation model, the machine translation model (We measure generation speed both on GPU and CPU hardware. Specifically, we measure GPU speed on three generations of Nvidia cards: a GTX-1080ti, an M40 as well as an older K40 card. CPU timings are measured on one host with 48 hyper-threaded cores (Intel Xeon E5-2680 @ 2.50GHz), pg. 6, right col, 5.3, Generation Speed) comprising:
	a decoder stage that predicts an output sequence based on the layer encoded representations generated by each of the one or more branched attention encoder layers (We add the conditional inputs computed by the attention (center right) to the decoder states which then predict the target words (bottom right), pg. 3, right col, Fig. 1)
	Gehring does not explicitly teach at least one interdependent scaling node that scales an intermediate representation of the branch by a learned scaling parameter, the learned scaling parameter depending on one or more other learned scaling parameters of one or more other interdependent scaling nodes of one or more other branches among the plurality of branches; an encoder stage including one or more branched attention encoder layers arranged sequentially, each branched attention encoder layer, a plurality of branches arranged in parallel, each branch, a parameterized attention network that performs an attention operation based on a layer encoded representation of a preceding branched attention encoder layer or, when the branched attention encoder layer is first among the one or more branched attention encoder layers, an input representation of an input sequence; a parameterized transformation network that performs a parameterized transformation operation based on an output representation of the parameterized attention network; an aggregation node that aggregates a plurality of branch output representations generated by the plurality of branches to generate a layer encoded representation of the branched attention encoder layer
	Wang teaches at least one interdependent scaling node (scale attention net 715 (for the first branch) and scale attention net 720 (for the second branch), Fig. 7)
	 that scales an intermediate representation of the branch (scale attention net 715 multiply various intermediate representations (Fig. 8) such as the feature data x1 840 is then combined with the attention map x1 835 via a multiplication operation 84)
	by a learned scaling parameter, the learned scaling parameter (all combinations of feature and attention data is combined in a weighted sum via addition operation 843 to produce dense feature 847, which is a collection of all dense features for all pixels, col 13, lines 13-16, Fig. 8. The Examiner notes that weight is a learned scaling parameter) 
	a plurality of branches arranged in parallel, each branch (705, 715, 725, Fig. 7 as first branch and 710, 720, 730, Fig. 7 as second branch) comprising:
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Gehring to incorporate the teachings of Wang for the benefit of tracking one or more objects (e.g., car), moving within images captured in sequence (e.g., video) implemented using one or more convolutional neural networks trained on images of different sizes, where each image of a given size has texture data emphasized via an attention map (Wang, col 2, lines 45-50)
	Madhavaraj teaches depending on one or more other learned scaling parameters of one or more other interdependent scaling nodes of one or more other branches among the plurality of branches; and (ANN weights 160 (Block 350) depends on ANN weights 160 (Block 310, Fig. 4); Essentially, the weights in the scaled ANN are equivalent to products k1wji 1, col 6, lines 15-16; For example, if the weights at the first layer are multiplied by a scale factor k1 and the weights at the second layer are multiplied by k2, the standard learning rate that would have been used for the unmodified weights is essentially modified to yield equivalent update steps according to the scale factors k1 and k2, col 6, lines 7-12;  Each layer includes a number of nodes, denoted Nn for the number of nodes at layer n, col 2, lines 12-13)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Modified Gehring to incorporate the teachings of Madhavaraj for the benefit of initialization and learning rate adjustment for artificial neural networks (ANNs) (Madhavaraj, col 1, lines 8-9).
	Hsu teaches an encoder stage including one or more branched attention encoder layers arranged sequentially, each branched attention encoder layer (By adding an attention mechanism to the encoder, we allow the second LSTM to attend to the sequence of output vectors from the first LSTM, pg. 3, right col, last para.) comprising:
	a parameterized attention network that performs an attention operation based on a layer encoded representation of a preceding branched attention encoder layer or, when the branched attention encoder layer is first among the one or more branched attention encoder layers, an input representation of an input sequence; (By adding an attention mechanism to the encoder, we allow the second LSTM to attend to the sequence of output vectors from the first LSTM, and hence generate a weighted representation of first object according to both objects. Let hN be the last output of second LSTM and M = [h1, h2, · · ·, hL] be the sequence of output vectors of the first object. The weighted representation of the first object is … We parametrize this model using another FNN. ... So the final input to the classifier will be hN, hI, as well as augmented features, pg. 3-4, left col, right and left col)
	a parameterized transformation network that performs a parameterized transformation operation based on an output representation of the parameterized attention network; (Figure 3 shows our framework: the three lower models are separate serialized LSTM-encoders for the three respective object pairs, whereas the upper model is an FNN that takes as input the concatenation of the outputs of three encoders, and predicts the relationships for all three pairs. More specifically, the output layer consists of three softmax layers where each one is intended to predict the relationship of one particular pair, pg. 4, left col, third para. “Examiner note: three softmax layers is a non-linear transformation function which is interpreted as transformation network”) and
	an aggregation node that aggregates a plurality of branch output representations generated by the plurality of branches to generate a layer encoded representation of the branched attention encoder layer (concatenate their outputs as an input to a feed-forward neural network (FNN), pg. 3, left col, first para.)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Gehring to incorporate the teachings of Hsu for the benefit of augmenting encoders with the ability to attend to past outputs directly (Hsu, pg. 1, right col, last para.) 

	Regarding claim 20, Modified Gehring teaches the non-transitory machine-readable medium of claim 18, Gehring teaches wherein the decoder stage includes one or more branched attention decoder layers, each branched attention decoder layer receiving the layer encoded representation generated by a corresponding branched attention encoder layer among the one or more branched attention encoder layers (In particular, the attention of the first layer determines a useful source context which is then fed to the second layer that takes this information into account when computing attention etc. The decoder also has immediate access to the attention history of the k-1 previous time steps because the conditional inputs c l-1 i-k, ..., cl-1 i are part of hl-1 i-k,...,hl-1 i which are input to hli, … In Appendix §C, we plot attention scores for a deep decoder and show that at different layers, different portions of the source are attended to, pg. 4, left col, first para.)

7.	Claim 19 is rejected under 35 U.S.C. 103 as being unpatentable over Gehring et al. (Convolutional sequence to sequence learning. International Conference on Machine Learning 2017 Jul 17 (pp. 1243-1252). PMLR.) in view of Wang et al. (US10552968) in view of Madhavaraj et al. (US10490182 filed 12/29/2016) in view of Hsu et al. (Recurrent Neural Network Encoder with Attention for Community Question Answering, arXiv:1603.07044v1 [cs.CL] 23 Mar 2016) and further in view of Min et al (US20180124331 filed on 10/26/2017) 

	Regarding claim 19, Modified Gehring the non-transitory machine-readable medium of claim 18, Modified Gehring does not explicitly teach wherein the at least one interdependent scaling node includes a first interdependent scaling node between the parameterized attention network and the parameterized transformation network and 
a second interdependent scaling node between the parameterized transformation network and the aggregation node, 
	Hsu teaches a second interdependent scaling node between the parameterized transformation network and the aggregation node (Figure 3 shows our framework: the three lower models are separate serialized LSTM-encoders for the three respective object pairs, whereas the upper model is an FNN that takes as input the concatenation of the outputs of three encoders, and predicts the relationships for all three pairs, pg. 4, left col, second para. “Examiner notes: model 1 of the three lower models LSTM-encoders is the second interdependent scaling node which is between the FNN (as aggregate node) and the relationships for all three pairs (as parameterized transformation network)”)
	It would have been obvious to a person having ordinary skill in the art before the effective filing date of the claimed invention to have modified the method of Gehring to incorporate the teachings of Hsu for the benefit of augmenting encoders with the ability to attend to past outputs directly (Hsu, pg. 1, right col, last para.)
	Min teaches wherein the at least one interdependent scaling node includes a first interdependent scaling node between the parameterized attention network and the parameterized transformation network (The attention mechanism involves layers 1 through L (collectively denoted by figure reference numeral 1010), feature extraction 1020, convolutional transformation 1030, spatial-temporal attention 1040, and abstraction attention 1050 [0084], Fig. 10 “Examiner notes: spatial-temporal attention 1040 is the first interdependent scaling node between abstraction attention 1050 and convolutional transformation 1030”)
	The same motivation to combine dependent claim 7 applies here.
	
Conclusion
	Any inquiry concerning this communication or earlier communications from the examiner should be directed to MORIAM MOSUNMOLA GODO whose telephone number is (571)272-8670. The examiner can normally be reached Monday-Friday 7:30am-5:30pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B. Zhen can be reached on (571)272-3768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/M.G./Examiner, Art Unit 2121                                    


 	
/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121