DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
In response to the office action from 12/23/2020, the applicant has submitted an amendment, filed 3/16/2021, arguing to traverse the prior art rejections. Applicant’s arguments have been fully considered and have not been determined persuasive for the reasons explained in the response to arguments.
Response to Arguments
In what follows applicant’s arguments and comments will be addressed in the order presented with each argument presented in a given ¶, to be followed by one or more ¶’s of respective examiner’s responses.
Following a broad overview of the last office action on page 2 ¶ 1, in ¶’s 2-3 it is argued that: “the definite article” “the” “in the term” “the first target context vector” “in claims 2 and 11 indicates a sequential order of the target context vector …”
Respectfully nowhere either in the claim or in the disclosure any “sequential order” associated with the “target context vector” is defined. The “target context vector” is simply defined as “Si” in an equation following specification ¶ 0028. Nowhere in the disclosure either expressly and/or implicitly (e.g. by way of a summation using the 
From the end of page 2 to page 4 ¶ before last, only broad overviews as well as copying of the selected parts of the office action, copying of the claim 1 and specific teachings of the primary reference “Graves1” is provided. Then in that ¶, it is concluded: “It can be seen that, in the Office Action, the hidden vector sequence in Graves1 was firstly asserted as corresponding to the source context vector in claim 1, and then the same hidden vector sequence in Graves1 was asserted again as corresponding to the weight vector in claim 1. Because of the conflicting assertions, Graves 1 cannot be expected to teach the above feature b) and feature c1) at the same time”.
This statement is fundamentally flawed, because it lacks any proof as to why an entity in the prior art (i.e., the function “h”) cannot function as both the claim’s “source context vector” as well as “weight vector” for the limitations that it was used. 
As an initial matter the claim limitation: “obtaining a weight vector according to the source context vector and the reference context vector” is NOT inconsistent with a “weight vector” be simply the “source context vector”, when the “source context vector” is in turn also determined using the “reference context vector”.
Next, the applicant disclosure also does not forbid this mapping. According to specification ¶ 0019 last sentence and ¶ 0020: “obtaining the weight vector by using the following equation: Zi=σ(Wz e(yi-1)+UzSi-1+CzCi)”, “where Zi is the weight vector, σ is an may be a sigmoid function”. Nothing in this question is specifically identified as the “reference context vector”, only “e(y i-1)” and “S i-1” are vaguely asserted to possess an implicit undefined relation to it. In conclusion since no specific “activation function” “σ” is defined, therefore here “Zi” (the “weight vector”) could simply reduce to “CzCi” which is simply the “source context vector”.
However, granted that this mapping will break down in the limitation: “weighting the source context vector and the reference context vector by using the weight vector”. For this limitation though the office action relied on “Who” in Graves2, equation 3, as the claim’s  “weight vector”, which was used for weighting “hu” (source context vector) and “b0” (reference context vector). Here two different entities in prior art are used to map to the claim’s “weight vector” and “source context vector” respectively, which makes the argument above moot. In response to applicant's arguments against the references individually, one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references.  See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986).
On page 5 the 2nd ¶, it is asserted: “As is the case with Graves1, Graves2 also fails to disclose or to suggest obtaining a weight vector according to the source context vector and the reference context vector”.
ho hu+b0” “Who is the hidden-output weight matrix”. Here “Who” maps as the claim’s weight vector, “hu” as the claim’s source context vector and “b0” as the claim’s reference context vector, and clearly “Who” can be obtained using “hu” and “b0”.
On page 5, ¶ 5, it is asserted that “claims 2-9, 11-18 and 20” “depend from the above-noted independent claims” “Accordingly” they “are patentably distinct over the cited art of record for at least those reasons stated above with respect to claims 1, 10 and 19”.
Since applicants have not argued the merits of these dependent claims, but assert patentability solely through their dependence on the allegedly patentable parent claims, they stand or fall with said parent claims and hence no further response to applicant’s arguments is necessary.
Claim Objections
Claims 2, 11 objected to because of the following informalities:  “the first target context vector” lacks proper antecedent.  Appropriate correction is required.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the 

Claims 1-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Graves (US Patent 9,263,036) (Graves1), and further in view of Graves (“Sequence Transduction with Recurrent Neural Networks”, “International Conference on Machine Learning (ICML 2012)” (Graves 2)).
Regarding claim 1, Graves1 does teach a method for converting a source sequence to a target sequence, performed by a computing device, wherein the source sequence and the target sequence are representations of natural language contents (Col. 1 lines 15-17: “The present invention relates generally to speech recognition” (a method of converting “speech” (a source natural language sequence) to “transcription” (a target sequence) using a speech recognizer computing device) “by neutral networks”), 
the method comprising:
obtaining the source sequence from an input signal (Col. 2 lines 56-57: “Given an input sequence x=(x1, ..., xT)” (obtaining a source sequence) “may compute the hidden vector sequence h=(h1,….,hT)”); 
converting the source sequence into one or more source context vectors (Col. 2 lines 56-57: “Given an input sequence x=(x1, …., xT)” (source sequence) “may compute” (converted into) “the hidden vector sequence h=(h1,….,hT)” (one or more source context vectors)); 

obtaining a weight vector according to the source context vector and the reference context vector (Col. 3 lines 49-54: “the hidden vector sequence” “ht n =H(w h(n-1)h(n)ht (n-1) +w hn hn h(t-1) n +bh n )” (a weight vector for when n is not 0) “Where h0=x” (and depends on “h” (source context vector) and “bh” (reference context vector)).
Graves1 do not specifically disclose:
obtaining a target context vector corresponding to each source context vector; 
combining the target context vectors to obtain the target sequence; and outputting the target sequence;
wherein obtaining a target context vector corresponding to each source context vector comprises:
weighting the source context vector and the reference context vector by using the weight vector, to obtain a weighted source context vector and a weighted reference context vector; and
predicting the target context vector corresponding to the source context vector by using the weighted source context vector and the weighted reference context vector.
Gaves2 does teach:
nd ¶: “Given y” (using the source context vector) “computes the hidden vector sequence” “and the prediction sequence (g0, …, gu)” (obtaining a target context vector e.g. “g0” … “gu”)) ; 
combining the target context vectors to obtain the target sequence (§2.1 2nd ¶: “the prediction sequence (g0,…,gu)” “gu=Wh0 +b0” (the combination of “g0” to “gu” forms a target sequence)); 
and outputting the target sequence (§ 2 last sentence: “the prediction network” “outputs the prediction vector sequence g=(g0, g1, …, gu)” (outputting the target sequence));
wherein obtaining a target context vector corresponding to each source context vector comprises:
weighting the source context vector and the reference context vector by using the weight vector, to obtain a weighted source context vector and a weighted reference context vector (§ 2.1 Eq. 3: “gu=Wh0 hu +b0” “W h0 is the hidden-output weight matrix” (this equation in “gu” is a weighted sum of “hu” (source context vector) and “b0” (reference context vector) and serves as weighted source and/or reference context vector)); and
predicting the target context vector corresponding to the source context vector by using the weighted source context vector and the weighted reference context vector nd ¶: “the prediction sequence”  (predicting the target context vector) “gu=Wh0 hu+b0” (using weighted source and reference context vector)).
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the methods of “Prediction Network” of Graves2 into the RNN implementation of Graves1 would enable the combined systems and their associated methods to perform in combination as they do separately and to further enable Grave1 to have as “output sequence” a “transcription sequence” as disclosed in the caption to Fig. 1 in Grave2 lines 1-4 and thus put the model to practical application.

Regarding claim 2, Graves1 do not specifically disclose the method according to claim 1, wherein the target context vectors are predicted sequentially, and wherein in obtaining the reference context vector,
when the target context vector to be predicted is the first target context vector of the target sequence, the reference context vector is an initial target context vector; and
when the target context vector to be predicted is not the first target context vector of the target sequence, the reference context vector comprises a previous target context vector corresponding to a previous source context vector.

and wherein in obtaining the reference context vector,
when the target context vector to be predicted is the first target context vector of the target sequence, the reference context vector is an initial target context vector (§ 2.1 Eq. 3: “gu=Wh0 hu +b0” (“b0” (the reference context vector used in obtaining “gu” (target context vector) is an initial target context vector)); and
when the target context vector to be predicted is not the first target context vector of the target sequence, the reference context vector comprises a previous target context vector corresponding to a previous source context vector (§ 2.1 Eq. 3: “gu=Wh0 hu +b0” (if the index “u” is nonzero which implies “gu” (the target context vector) is not the first target context vector, according to this equation it depends on “b0” (a previous reference context vector which is associated with “h0” which is a previous source context vector) and “W h0” (which depends on “g0” which is a previous target context vector associated with “h0”)).
For obviousness to combine Graves1 and Graves2 see claim 1.

Regarding claim 3, Graves1 does teach:

obtaining the weight vector zi by using the following equation: zi=σ(Wz e(yi-1) +Uz S i-1 +Cz ci)
where zi is the weight vector, σ is an activation function, e(yi-1) is a word vector in the ith reference context vector,  S i-1 is an intermediate state in the ith reference context vector, ci is the ith source context vector, Wz, Uz, and Cz are module parameters of the activation function σ, and i represents a sequence number of a vector (Col. 3 lines 49-54: “the hidden vector sequence” “ht n =H(w h(n-1)h(n)ht (n-1) +w hn hn h(t-1) n +bh n )” (“ht n” or the weight vector depends on the function “H” (σ) which is according to Col. 2 lines 66-67: “H is usually an elementwise application of a sigmoid function” (is an activation function), “h(t-1) n” is an intermediate state in the RNN, “bh n” (is a reference context vector with same index as “ht n”), “ht (n-1)” (for n=1 it reduces to “x” e.g. a word or word vector as it is associated with the original “input sequence” which also makes it a source context vector, and finally “w h(n-1)h(n)” and “w hn hn” serve as module parameters of the function “H” (activation function)).
Graves1 do not specifically disclose the method according to claim 1, wherein the target context vectors are predicted sequentially.

For obviousness to combine Graves1 and Graves2 see claim 1.

Regarding claim 4, Graves1 do teach the method according to claim 3, wherein the module parameters Wz, Uz, and Cz are obtained by learning from training data (Col. 4 lines 15-17: “The training module trains” (learning) “the networks by using their activations to define a normalised, differentiable distribution Pr(y|x) over output sequences” (from training data) “and optimising the network weights” (e.g., “w h(n-1)h(n)” and “w hn hn” in Eq. 9 which serve as module parameters of the activation function)).

Regarding claim 5, Graves1 do teach the method according to claim 3, wherein the activation function is a sigmoid function f(z)=1/(1+exp(-z)) (Col. 2 lines 66-67: “H is usually an elementwise application of a sigmoid function” (i.e., as shown in Graves2 § 2.1 “sigmoid σ(x)=1/(1+exp(-x))”).

Regarding claim 6, Graves1 does teach the method according to claim 3, wherein the module parameters Wz, Uz, and Cz are obtained by way of maximizing a likelihood 
Argmax ∑ logp(Yn|Xn)
where N is a quantity of training sequence pairs in a training sequence set, Xn is a source sequence in a nth training sequence pair, Yn is a target sequence in the nth training sequence pair, P is a parameter of the computing device (Col. 4 lines 15-17: “The training module trains” (training) “the networks by using their activations to define a normalised, differentiable distribution Pr(y|x)” (involving a probability or likelihood“Pr” (a parameter of the computing device) from “x” (a source sequence) to “y” (a target sequence)) “output sequences” (from a training sequence) “and optimising the network weights” (e.g., “w h(n-1)h(n)” and “w hn hn” in Eq. 9 which serve as module parameters of the activation function)) “for example by applying gradient descent to maximize log Pr(z|x)” (by maximizing the likelihood of the said source target training sequence); according to SJOLUND (US 2017/0177812)  ¶ 0057 page 7 lines 2+ “maximum likelihood” (“ML”) calculation involves “setting θML=argmaxθ p(d|x,θ)”, where the probability as shown in ¶ 0066 involves a summation over all possible “source” and “target” sequences).


Graves2 does teach the method according to claim 1, wherein dimensions of the weight vector is the same as dimensions of the target context vector (§ 2.1 Eq. 3+: “gu=Wh0 hu +b0” “Wh0” (the weight vector) “is the hidden-output weight matrix” (i.e. it must have the same number of columns (dimensions) as “hu” and/or “gu” (target context vector)).
For obviousness to combine Graves1 and Graves2 see claim 1.

Regarding claim 8, Graves1 do teach the method according to claim 1, wherein each element in the weight vector is a real number greater than 0 and less than 1 (Col. 3 lines 49-52: “the hidden vector sequence” “ht n =H(w h(n-1)h(n)ht (n-1) +w hn hn h(t-1) n +bh n )” (the weight vector), where “H is usually an elementwise application of a sigmoid function” (Col. 2 lines 66-67), where “sigmoid σ(x)=1/(1+exp(-x))” (it is always less than one and greater than 0)).

Regarding claim 9, Graves1 do not specifically disclose the method according to claim 1, wherein the target context vectors are predicted sequentially, and wherein predicting the ith target context vector corresponding to the ith source context vector 
obtaining the ith target context vector Si by using the following equation:
si=f((1-zi){We(yi-1)+Usi-1}+zi Cci)
where Si is the ith target context vector corresponding to the ith source context vector, f is an activation function, e(yi-i) is the word vector in the ith reference context vector, Si-1 is an intermediate state in the ith reference context vector, zi is the weight vector, Ci is the source context vector, W, U, and C are module parameters of the activation function f, and i represents a sequence number of a vector.
Graves2 do teach the method according to claim 1, wherein the target context vectors are predicted sequentially (§ 2.1 second ¶ lines 1-3: “computes” “the prediction sequence (g0, …, gu)” “from u=0 to u” (the target context vectors are predicted sequentially)), 
and wherein predicting the ith target context vector corresponding to the ith source context vector by using the ith weighted source context vector and the ith weighted reference context vector comprises:
obtaining the ith target context vector Si by using the following equation:
si=f((1-zi){We(yi-1)+Usi-1}+zi Cci)
where si is the ith target context vector corresponding to the ith source context vector, f is an activation function, e(yi-i) is the word vector in the ith reference context h0 H (W ih  yu +W hh h u-1 +bh) +b0” (the ith target context vector corresponding to the ith source context vector) “where W ih is the input-hidden weight matrix, W hh is the hidden-hidden weight matrix” (e.g. W or U or C or a module parameters) “Wh0 is the hidden-output weight matrix” (is the weight vector) “H is the hidden layer function” “is an elementwise application of” “logistic sigmoid” “functions” (f an activation function), “yu” corresponds to “input sequence” (is a word vector § 2.1 lines 3-4), “h u-1” (is an intermediate state), and “b0” (is the reference context vector)) and “u” signifies the sequence number of the vectors).
For obviousness to combine Graves1 and Graves2 see claim 1.

Regarding claim 10, Graves1 does teach an apparatus, comprising a processor and a non-transitory storage medium storing program instructions for execution by the processor (Col. 2 lines 6-10: “It will also be appreciated that any module, unit, component, server, computer, terminal or device exemplified herein that executes instructions may include or otherwise have access to computer readable media such as storage media, computer storage media, or data storage devices”; Col. 2 lines 49-52: “Referring now to FIG. 1, a deep RNN is shown. As will be appreciated by a person of 
Wherein the program instructions, when executed by the processor, cause the apparatus to perform a process of converting a source sequence to a target sequence,  wherein the source sequence and the target sequence are representations of natural language contents (Col. 1 lines 15-17: “The present invention relates generally to speech recognition” (a method of converting “speech” (a source natural language sequence) to “transcription” (a target sequence) using a speech recognizer computing device) “by neutral networks”), 
And the process comprises:
obtaining the source sequence from an input signal (Col. 2 lines 56-57: “Given an input sequence x=(x1, ..., xT)” (obtaining a source sequence) “may compute the hidden vector sequence h=(h1,….,hT)”); 
converting the source sequence into one or more source context vectors (Col. 2 lines 56-57: “Given an input sequence x=(x1, …., xT)” (source sequence) “may compute” (converted into) “the hidden vector sequence h=(h1,….,hT)” (one or more source context vectors)); 
obtaining a reference context vector corresponding to 
obtaining a weight vector according to the source context vector and the corresponding reference context vector (Col. 3 lines 49-54: “the hidden vector sequence” “ht n =H(w h(n-1)h(n)ht (n-1) +w hn hn h(t-1) n +bh n )” (a weight vector for when n is not 0) “Where h0=x” (and depends on “h” (source context vector) and “bh” (reference context vector)).
Graves1 do not specifically disclose:
obtaining a target context vector corresponding to each source context vector; 
combining the target context vectors to obtain the target sequence; and outputting the target sequence;
wherein obtaining a target context vector corresponding to each source context vector comprises:
weighting the source context vector and the reference context vector by using the weight vector, to obtain a weighted source context vector and a weighted reference context vector; and
predicting the target context vector corresponding to the source context vector by using the weighted source context vector and the weighted reference context vector.
Gaves2 does teach:
nd ¶: “Given y” (using the source context vector) “computes the hidden vector sequence” “and the prediction sequence (g0, …, gu)” (obtaining a target context vector e.g. “g0” … “gu”)) ; 
combining the target context vectors to obtain the target sequence (§2.1 2nd ¶: “the prediction sequence (g0,…,gu)” “gu=Wh0 +b0” (the combination of “g0” to “gu” forms a target sequence)); 
and outputting the target sequence (§ 2 last sentence: “the prediction network” “outputs the prediction vector sequence g=(g0, g1, …, gu)” (outputting the target sequence));
wherein obtaining a target context vector corresponding to each source context vector comprises:
weighting the source context vector and the reference context vector by using the weight vector, to obtain a weighted source context vector and a weighted reference context vector (§ 2.1 Eq. 3: “gu=Wh0 hu +b0” “W h0 is the hidden-output weight matrix” (this equation in “gu” is a weighted sum of “hu” (source context vector) and “b0” (reference context vector) and serves as weighted source and/or reference context vector)); and
predicting the target context vector corresponding to the source context vector by using the weighted source context vector and the weighted reference context vector nd ¶: “the prediction sequence”  (predicting the target context vector) “gu=Wh0 hu+b0” (using weighted source and reference context vector)).
It would have therefore been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate the methods of “Prediction Network” of Graves2 into the RNN implementation of Graves1 would enable the combined systems and their associated methods to perform in combination as they do separately and to further enable Grave1 to have as “output sequence” a “transcription sequence” as disclosed in the caption to Fig. 1 in Grave2 lines 1-4 and thus put the model to practical application.

Regarding claim 11, Graves1 do not specifically disclose the apparatus according to claim 10, wherein the target context vectors are predicted sequentially, and wherein in obtaining the reference context vector,
when the target context vector to be predicted is the first target context vector of the target sequence, the reference context vector is an initial target context vector; and
when the target context vector to be predicted is not the first target context vector of the target sequence, the reference context vector comprises a previous target context vector corresponding to a previous source context vector.

and wherein in obtaining the reference context vector,
when the target context vector to be predicted is the first target context vector of the target sequence, the reference context vector is an initial target context vector (§ 2.1 Eq. 3: “gu=Wh0 hu +b0” (“b0” (the reference context vector used in obtaining “gu” (target context vector) is an initial target context vector)); and
when the target context vector to be predicted is not the first target context vector of the target sequence, the reference context vector comprises a previous target context vector corresponding to a previous source context vector (§ 2.1 Eq. 3: “gu=Wh0 hu +b0” (if the index “u” is nonzero which implies “gu” (the target context vector) is not the first target context vector, according to this equation it depends on “b0” (a previous reference context vector which is associated with “h0” which is a previous source context vector) and “W h0” (which depends on “g0” which is a previous target context vector associated with “h0”)).
For obviousness to combine Graves1 and Graves2 see claim 10.

Regarding claim 12, Graves1 does teach:

obtaining the weight vector zi by using the following equation: zi=σ(Wz e(yi-1) +Uz S i-1 +Cz ci)
where zi is the weight vector, σ is an activation function, e(yi-1) is a word vector in the ith reference context vector,  S i-1 is an intermediate state in the ith reference context vector, ci is the ith source context vector, Wz, Uz, and Cz are module parameters of the activation function σ, and i represents a sequence number of a vector (Col. 3 lines 49-54: “the hidden vector sequence” “ht n =H(w h(n-1)h(n)ht (n-1) +w hn hn h(t-1) n +bh n )” (“ht n” or the weight vector depends on the function “H” which is according to Col. 2 lines 66-67: “H is usually an elementwise application of a sigmoid function” (is an activation function), “h(t-1) n” is an intermediate state in the RNN, “bh n” (is a reference context vector with same index as “ht n”), “ht (n-1)” (for n=1 it reduces to “x” e.g. a word or word vector as it is associated with the original “input sequence” which also makes it a source context vector, and finally “w h(n-1)h(n)” and “w hn hn” serve as module parameters of the function “H” (activation function)).
Graves1 do not specifically disclose the apparatus according to claim 10, wherein the target context vectors are predicted sequentially.

For obviousness to combine Graves1 and Graves2 see claim 10.

Regarding claim 13, Graves1 do teach the apparatus according to claim 12, wherein the module parameters Wz, Uz, and Cz are obtained by learning from training data (Col. 4 lines 15-17: “The training module trains” (learning) “the networks by using their activations to define a normalised, differentiable distribution Pr(y|x) over output sequences” (from training data) “and optimising the network weights” (e.g., “w h(n-1)h(n)” and “w hn hn” in Eq. 9 which serve as module parameters of the activation function)).

Regarding claim 14, Graves1 do teach the method according to claim 12, wherein the activation function is a sigmoid function f(z)=1/(1+exp(-z)) (Col. 2 lines 66-67: “H is usually an elementwise application of a sigmoid function” (i.e., as shown in Graves2 § 2.1 “sigmoid σ(x)=1/(1+exp(-x))”).

Regarding claim 15, Graves1 does teach the apparatus according to claim 12, wherein the module parameters Wz, Uz, and Cz are obtained by way of maximizing a 
Argmax ∑ logp(Yn|Xn)
where N is a quantity of training sequence pairs in a training sequence set, Xn is a source sequence in a nth training sequence pair, Yn is a target sequence in the nth training sequence pair, P is a parameter of the apparatus (Col. 4 lines 15-17: “The training module trains” (training) “the networks by using their activations to define a normalised, differentiable distribution Pr(y|x)” (involving a probability or likelihood“Pr” (a parameter of the computing device) from “x” (a source sequence) to “y” (a target sequence)) “output sequences” (from a training sequence) “and optimising the network weights” (e.g., “w h(n-1)h(n)” and “w hn hn” in Eq. 9 which serve as module parameters of the activation function)) “for example by applying gradient descent to maximize log Pr(z|x)” (by maximizing the likelihood of the said source target training sequence); according to SJOLUND (US 2017/0177812)  ¶ 0057 page 7 lines 2+ “maximum likelihood” (“ML”) calculation involves “setting θML=argmaxθ p(d|x,θ)”, where the probability as shown in ¶ 0066 involves a summation over all possible “source” and “target” sequences).


Graves2 does teach the apparatus according to claim 10, wherein dimensions of the weight vector is the same as dimensions of the target context vector (§ 2.1 Eq. 3+: “gu=Wh0 hu +b0” “Wh0” (the weight vector) “is the hidden-output weight matrix” (i.e. it must have the same number of columns (dimensions) as “hu” and/or “gu” (target context vector)).
For obviousness to combine Graves1 and Graves2 see claim 10.

Regarding claim 17, Graves1 do teach the apparatus according to claim 10, wherein each element in the weight vector is a real number greater than 0 and less than 1 (Col. 3 lines 49-52: “the hidden vector sequence” “ht n =H(w h(n-1)h(n)ht (n-1) +w hn hn h(t-1) n +bh n )” (the weight vector), where “H is usually an elementwise application of a sigmoid function” (Col. 2 lines 66-67), where “sigmoid σ(x)=1/(1+exp(-x))” (it is always less than one and greater than 0)).

Regarding claim 18, Graves1 do not specifically disclose the apparatus according to claim 10, wherein the target context vectors are predicted sequentially, and wherein predicting the ith target context vector corresponding to the ith source context vector 
obtaining the ith target context vector Si by using the following equation:
si=f((1-zi){We(yi-1)+Usi-1}+zi Cci)
where Si is the ith target context vector corresponding to the ith source context vector, f is an activation function, e(yi-i) is the word vector in the ith reference context vector, Si-1 is an intermediate state in the ith reference context vector, zi is the weight vector, Ci is the source context vector, W, U, and C are module parameters of the activation function f, and i represents a sequence number of a vector.
Graves2 do teach the apparatus according to claim 10, wherein the target context vectors are predicted sequentially (§ 2.1 second ¶ lines 1-3: “computes” “the prediction sequence (g0, …, gu)” “from u=0 to u” (the target context vectors are predicted sequentially)), 
and wherein predicting the ith target context vector corresponding to the ith source context vector by using the ith weighted source context vector and the ith weighted reference context vector comprises:
obtaining the ith target context vector Si by using the following equation:
si=f((1-zi){We(yi-1)+Usi-1}+zi Cci)
where si is the ith target context vector corresponding to the ith source context vector, f is an activation function, e(yi-i) is the word vector in the ith reference context h0 H (W ih  yu +W hh h u-1 +bh) +b0” (the ith target context vector corresponding to the ith source context vector) “where W ih is the input-hidden weight matrix, W hh is the hidden-hidden weight matrix” (e.g. W or U or C or a module parameters) “Wh0 is the hidden-output weight matrix” (is the weight vector) “H is the hidden layer function” “is an elementwise application of” “logistic sigmoid” “functions” (f an activation function), “yu” corresponds to “input sequence” (is a word vector § 2.1 lines 3-4), “h u-1” (is an intermediate state), and “b0” (is the reference context vector)) and “u” signifies the sequence number of the vectors).
For obviousness to combine Graves1 and Graves2 see claim 10.

Regarding claim 19, Graves1 does teach a non-transitory storage medium storing computer instructions that, when executed by one or more processors in a computing device (Col. 2 lines 6-10: “It will also be appreciated that any module, unit, component, server, computer, terminal or device exemplified herein that executes instructions may include or otherwise have access to computer readable media such as storage media, computer storage media, or data storage devices”; Col. 2 lines 49-52: “Referring now to FIG. 1, a deep RNN is shown. As will be appreciated by a person of skill in the art, any 
cause the computing device to perform a process of converting a source sequence to a target sequence, wherein the source sequence and the target sequence are representations of natural language contents (Col. 1 lines 15-17: “The present invention relates generally to speech recognition” (a method of converting “speech” (a source natural language sequence) to “transcription” (a target sequence) using a speech recognizer computing device) “by neutral networks”), 
And wherein the process comprises the steps of:
obtaining the source sequence from an input signal (Col. 2 lines 56-57: “Given an input sequence x=(x1, ..., xT)” (obtaining a source sequence) “may compute the hidden vector sequence h=(h1,….,hT)”); 
converting the source sequence into one or more source context vectors (Col. 2 lines 56-57: “Given an input sequence x=(x1, …., xT)” (source sequence) “may compute” (converted into) “the hidden vector sequence h=(h1,….,hT)” (one or more source context vectors)); 
obtaining a reference context vector corresponding to 
n =H(w h(n-1)h(n)ht (n-1) +w hn hn h(t-1) n +bh n )” (a weight vector for when n is not 0) “Where h0=x” (and depends on “h” (source context vector) and “bh” (reference context vector)).
Graves1 do not specifically disclose:
obtaining a target context vector corresponding to each source context vector; 
combining the target context vectors to obtain the target sequence; and outputting the target sequence;
wherein obtaining a target context vector corresponding to each source context vector comprises:
weighting the source context vector and the reference context vector by using the weight vector, to obtain a weighted source context vector and a weighted reference context vector; and
predicting the target context vector corresponding to the source context vector by using the weighted source context vector and the weighted reference context vector.
Gaves2 does teach:
obtaining a target context vector corresponding to each source context vector (§2.1 2nd ¶: “Given y” (using the source context vector) “computes the hidden vector 
combining the target context vectors to obtain the target sequence (§2.1 2nd ¶: “the prediction sequence (g0,…,gu)” “gu=Wh0 +b0” (the combination of “g0” to “gu” forms a target sequence)); 
and outputting the target sequence (§ 2 last sentence: “the prediction network” “outputs the prediction vector sequence g=(g0, g1, …, gu)” (outputting the target sequence));
wherein obtaining a target context vector corresponding to each source context vector comprises:
weighting the source context vector and the reference context vector by using the weight vector, to obtain a weighted source context vector and a weighted reference context vector (§ 2.1 Eq. 3: “gu=Wh0 hu +b0” “W h0 is the hidden-output weight matrix” (this equation in “gu” is a weighted sum of “hu” (source context vector) and “b0” (reference context vector) and serves as weighted source and/or reference context vector)); and
predicting the target context vector corresponding to the source context vector by using the weighted source context vector and the weighted reference context vector (§2.1 2nd ¶: “the prediction sequence”  (predicting the target context vector) “gu=Wh0 hu+b0” (using weighted source and reference context vector)).


Regarding claim 20, Graves1 does teach:
and wherein when a current source context vector is the ith source context vector of the source sequence, obtaining a weight vector zi according to the ith source context vector and the corresponding ith reference context vector comprises:
obtaining the weight vector zi by using the following equation: zi=σ(Wz e(yi-1) +Uz S i-1 +Cz ci)
where zi is the weight vector, σ is an activation function, e(yi-1) is a word vector in the ith reference context vector,  S i-1 is an intermediate state in the ith reference context vector, ci is the ith source context vector, Wz, Uz, and Cz are module parameters of the activation function σ, and i represents a sequence number of a vector (Col. 3 lines 49-54: “the hidden vector sequence” “ht n =H(w h(n-1)h(n)ht (n-1) +w hn hn h(t-1) n +bh n )” (“ht n” or the weight vector depends on the function “H” which is according to n” is an intermediate state in the RNN, “bh n” (is a reference context vector with same index as “ht n”), “ht (n-1)” (for n=1 it reduces to “x” e.g. a word or word vector as it is associated with the original “input sequence” which also makes it a source context vector, and finally “w h(n-1)h(n)” and “w hn hn” serve as module parameters of the function “H” (activation function)).
Graves1 do not specifically disclose the non-transitory storage medium according to claim 19, wherein the target context vectors are predicted sequentially.
 Graves2 does teach the non-transitory storage medium according to claim 19, wherein the target context vectors are predicted sequentially (§ 2.1 second ¶ lines 1-3: “computes” “the prediction sequence (g0, …, gu)” “from u=0 to u” (the target context vectors are predicted sequentially)).
For obviousness to combine Graves1 and Graves2 see claim 19.

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to FARZAD KAZEMINEZHAD whose telephone number is (571)270-5860.  The examiner can normally be reached on 10:30 am to 11:30 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, DANIEL C WASHBURN can be reached on (571)272-5551.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For 






/Farzad Kazeminezhad/
Art Unit 2657
June 2nd 2021.