Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this 
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are: 
a “subsystem configured to” in claims 1 and 13.
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.
 


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1, 2, 4, 8, 9, 10, 11, 13, 14, 16, and 20 are rejected under 35 U.S.C. 102 as being unpatentable over Song (“A Unified Query-based Generative Model for Question Generation and Question Answering”, 2017) and in view of Li (US 20170109355 A1).  

Regarding claim 1, Song teaches an encoder neural network configured to: receive an input question sequence comprising a respective question token ([p. 1 Col. 2] "in the QA task, the input query is a question, and the decoder generates the corresponding answer" [p. 2 Col. 1] "The model takes two components as input: a passage P = (p1, ..., pj , ..., pN ) of length N, and a query Q = (q1, ..., qi , ..., qM) of length M, then generates the output sequence X = (x1, ..., xL) word by word" See also Encoder in FIG. 1.  Query vector interpreted as synonymous with question token.).
at each of a plurality of encoder time steps, and ([p. 2 Col. 1] "The encoder matches each time-step of the passage against all time-steps of the query from multiple perspectives").
for each of the encoder time steps, process the question token at the encoder time step to generate an encoded representation of the question token; ([p. 2 Col. 1] "The encoder matches each time-step of the passage against all time-steps of the query from multiple perspectives, and encodes the matching result into a “Multi-perspective Memory”").
a decoder recurrent neural network configured to, at each of a plurality of decoder time steps: receive a decoder input at the decoder time step, and ([p. 2 Col. 1] "In addition, the decoder generates the output sequence one word at a time based on the “Multi-perspective Memory”." [p. 3 Col. 2] "The decoder takes the “Multi-perspective Memory” as the attention memory, and generates the output one word at a time.").
process the decoder input and a preceding decoder hidden state to generate an updated decoder hidden state for the decoder time step; and ([p. 3 Col. 2] "For each time-step t, the decoder first feeds the concatenation of the previous word embedding xt−1 and context vector ct−1 into the LSTM model to update the hidden state").
a subsystem configured to: at each of the encoder time steps: determine whether the question token at the encoder time step satisfies one or more criteria for adding a variable representing the question token to a vocabulary of possible outputs; and ([p. 3 Col. 1] See FIG. 2.  "The inputs include the contextual vector of one time-step of the passage (left orange block) and the contextual vectors of all time steps of the query (right blue blocks). The output is a vector of matching values (top green block) calculated via fm").
when the question token at the encoder time step satisfies the one or more criteria, add the variable to the vocabulary of possible outputs ([p. 3 Col. 1] See FIG. 2.  "The inputs include the contextual vector of one time-step of the passage (left orange block) and the contextual vectors of all time steps of the query (right blue blocks). The output is a vector of matching values (top green block) calculated via fm").
and associate the encoded representation of the question token as an encoded representation for the variable; and (See FIG. 2.  The encoded representation of the question token at that time step are associated with the output variable.).
at each of the decoder time steps: determine, from the updated decoder hidden state at the decoder time step and from respective encoded representations for possible outputs in the vocabulary of possible outputs, ([p. 4 Col. 1] "gt is the switch for controlling generating a word from the vocabulary or directly copying it from the passage. Pvocab is the generating probability distribution as defined above, and Pattn is calculated based on the attention dis tribution αt by merging probabilities of duplicated words. Intuitively, gt is relevant to the current decoder state").
a respective output score for each possible output in the vocabulary of possible outputs, and ([p. 4 Col. 2] "Ys = ys0, ..., ysT is the sampled sequence, Yˆ is the sequence generated from a baseline, and the function r(Y ) is the reward calculated based on the evaluation metric...for the QA task, we use the ROUGE score (Lin, 2004) as the reward." Song explicitly teaches that output probability is based on a scoring metric propagated from the decoder.).
select, using the output scores, an output from the vocabulary of possible outputs as a decoder output at the decoder time step. ([p. 4 Col. 1] "gt is the switch for controlling generating a word from the vocabulary or directly copying it from the passage. Pvocab is the generating probability distribution as defined above, and Pattn is calculated based on the attention dis tribution αt by merging probabilities of duplicated words. Intuitively, gt is relevant to the current decoder state" [p. 3 Col. 2] "The decoder takes the “Multi-perspective Memory” as the attention memory, and generates the output one word at a time"). 
However, Song does not explicitly teach A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to implement: 

Li, who teaches a related art of a neural network based question answering system, teaches A system comprising one or more computers and one or more storage devices storing instructions that when executed by the one or more computers cause the one or more computers to implement: ([¶0018] “Furthermore, one skilled in the art will recognize that embodiments of the present invention, described below, may be implemented in a variety of ways, such as a process, an apparatus, a system, a device, or a method on a tangible computer-readable medium.” [¶0019] “It should be noted that functions or operations discussed herein may be implemented as components. Components may be implemented in software, hardware, or a combination thereof.” Implemented in software interpreted as synonymous with instructions that when executed by a computer cause the one or more computers to implement. [¶0100] “For example, a computing system may be a personal computer (e.g., laptop), tablet computer, phablet, personal digital assistant (PDA), smart phone, smart watch, smart package, server (e.g., blade server or rack server), a network storage device, or any other suitable device and may vary in size, shape, performance, functionality, and price”).

Song and Li are both directed towards neural network based question answering systems.  It would have been obvious to one of ordinary skill in the art before the effective filing date to combine the teachings of Song with the teachings of Li by implementing the method taught in Song on a generic computer system comprising one or more computers and one or more storage devices storing instructions that when executed cause the computer to implement the method.  Li explicitly teaches ([¶0018] “Furthermore, one skilled in the art will recognize that embodiments of the present invention, described below, may be implemented in a variety of ways, such as a process, an apparatus, a system, a device, or a method on a tangible computer-readable medium.”).

Regarding claim 2, Song teaches The system of claim 1, wherein the possible outputs in the vocabulary of possible outputs are tokens from computer program expressions, and wherein the tokens include, for each of a plurality of functions, a function identifier for the function and possible arguments to the function. ([p. 3 Col. 2] "Concretely, while generating the t-th word xt, the de coder considers five factors as the input: (1) the “Multi perspective Memory” H = {h0, ..., hi , ..., hN }, where each vector hi ∈ H aligns to the i-th word in the passage; (2) the previous hidden state of the LSTM model st−1; (3) the em bedding of previously generated word xt−1; (4) the previous context vector ct−1, which is calculated from the attention mechanism with H being the attentional memory; and (5) the previous coverage vector ut−1, which is the accumula tion of all attention distributions so far. When t = 0, we initialize s−1, c−1 and u−1 as zero vectors, and fix x−1 to be the embedding of the sentence start token “<s>”." Sentence start token interpreted as computer program function identifier.). 

Regarding claim 4, Song teaches The system of claim 2, wherein selecting the output from the vocabulary of possible outputs comprises: identifying as a valid output for the decoder time step any output from the vocabulary of possible outputs that would not cause a semantic error or a syntax error when following an output at the preceding decoder time step; ([p. 3 Col. 2] "Intuitively, each perspective calculates the cosine similarity between two input vectors, and it is associated with a weight vector trained to highlight different dimensions of the input vectors. This can be regarded as considering different part of the semantics captured in the vector" Considering parts of semantics in the matching algorithm interpreted as synonymous with selecting possible outputs that would not cause a semantic error when following output at the preceding decoder time step.).
and selecting the output only from the valid outputs for the decoder time step. ([p. 3 Col. 2] "The final matching vector mj for each time-step of the passage is the concatenation of the matching results of all four strategies. We also employ another BiLSTM layer on top of the matching layer to smooth the matching results.  We concatenate the contextual vectors, hpj, of the passage and matching vectors to be the Multi-perspective Memory H, which contains both the passage information and the matching information." Matching vectors interpreted as synonymous with valid outputs.).

Regarding claim 8, Song teaches The system of claim 1, wherein the subsystem is further configured to, at each of the decoder time steps: generate, using the updated decoder hidden state at the decoder time step, a context vector that corresponds to a weighted combination of the encoded representations of the question tokens; and ([p. 3 Col. 2] "Second, the attention distribution αt,i for each time-step of the “Multi-perspective Memory” hi ∈ H is calculated with the following equations...Wh, Ws, Wv, be and ve are learnable parameters. The coverage vector ut is then updated by ut = ut−1 + αt. And the new context vector ct is calculated via: [See Eqn. for ct]").
generate, using the updated decoder hidden state at the decoder time step and the context vector that corresponds to the weighted sum over the encoded representation of the question tokens, an initial output vector at the decoder time step. (See Eqns p. 3 Col. 2  [p. 3 Col. 2] "The decoder takes the “Multi-perspective Memory” as the attention memory, and generates the output one word at a time...For each time-step t, the decoder first feeds the concatenation of the previous word embedding xt-1 and context vector ct-1 into the LSTM model to update the hidden state" Context equation in Song is a sum of weighted sums representing question tokens.). 

Regarding claim 9, Song teaches The system of claim 8, wherein the subsystem is further configured to, at each of the decoder time steps: calculate, for at least a plurality of the encoded representations, a similarity measure between the initial output vector at the decoder time step and the respective encoded representations for the possible outputs in the vocabulary of possible outputs; and ([p. 3 Col. 1] "we first calculate the cosine similarities between each forward (or backward) contextual vector of the passage and every forward (or backward) contextual vectors of the question.  Then, we take the cosine similarities as the weights, and calculate an attentive vector for the entire query by weighted summing all the contextual vectors of the query" See LSTM decoder implementation for relation to time step.).
generate, using the calculated similarity measure between the initial output vector at the decoder time step and the respective encoded representations for the possible outputs in the vocabulary of possible outputs, a respective logit for each possible output in the vocabulary of possible outputs. ([p. 3 Col. 1] "Finally, we match each forward (or backward) contextual vector of the passage with its corresponding attentive vector." [p. 3 Col. 2] "(1) the “Multiperspective Memory” H = fh0; :::; hi; :::; hNg, where each vector hi 2 H aligns to the i-th word in the passage; (2) the previous hidden state of the LSTM model st-1; (3) the embedding of previously generated word xt-1; (4) the previous context vector ct-1, which is calculated from the attention mechanism with H being the attentional memory; and (5) the previous coverage vector ut-1, which is the accumulation of all attention distributions so far" Attention distribution interpreted as synonymous with logit.). 

Regarding claim 10, Song teaches The system of claim 9, wherein the subsystem is configured to, at each of the decoder time steps: select, using the respective output score for each possible output in the vocabulary of possible outputs and the logits for each possible output in the vocabulary of possible outputs, an output from the vocabulary of possible outputs as a decoder output at the decoder time step. ([p. 4 Col. 1] "The probability distribution is defined as the interpolation between two probability distributions: Pfinal = gtPvocab + (1 - gt)Pattn, where gt is the switch for controlling  generating a word from the vocabulary or directly copying it from the passage. Pvocab is the generating probability distribution as defined above, and Pattn is calculated based on the attention distribution at by merging probabilities of duplicated words. Intuitively, gt is relevant to the current decoder state, the attention results and the input. " Song explicitly teaches that gt is used to select an output word and that gt is relevant to the current decoder state which is interpreted as synonymous with the decoder time step.). 

Regarding claim 11, Song teaches The system of claim 9, wherein the subsystem is configured to determine the respective output score for each possible output in the vocabulary of possible outputs by applying a softmax over the respective logit for each possible output in the vocabulary of possible outputs. ([p. 4 Col. 1] "the output probability distribution over a vocabulary of words at the current state is calculated by: Pvocab = softmax(V2(V1[st; ct] + b1) + b2), where V1, V2, b1 and b2 are learnable parameters. The number of rows in V2 represents the number of words in the vocabulary"). 

Regarding claim 13, claim 13 effectively mirrors claim 1 and is therefore rejected under a similar interpretation.

Regarding claim 14, claim 14 effectively mirrors claim 2 and is therefore rejected under a similar interpretation.

Regarding claim 16, claim 16 effectively mirrors claim 4 and is therefore rejected under a similar interpretation.

Regarding claim 20, claim 20 effectively mirrors claim 8 and is therefore rejected under a similar interpretation.

Claims 3 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Song and Li, in view of Yin (“Neural Generative Question Answering”, 2015). 

Regarding claim 3, Song teaches The system of claim 2.  
However, Song does not explicitly teach wherein determining whether the one or more criteria are satisfied comprises: determining whether the question token at the encoder time step identifies an entity that is represented in a knowledge base; and wherein the subsystem is further configured to: 
in response to determining that the question token at the encoder time step identifies an entity that is represented in the knowledge base, linking the variable representing the question token to the entity that is represented in the knowledge base.  

Yin, who teaches a related art of a sequence to sequence encoder-decoder framework for generating responses to questions, teaches wherein determining whether the one or more criteria are satisfied comprises: determining whether the question token at the encoder time step identifies an entity that is represented in a knowledge base; and wherein the subsystem is further configured to: ([p. 4 §3] "Enquirer takes HQ as input to interact with the knowledge-base in the long-term memory, retrieves relevant facts (triples) from the knowledge-base, and summarizes the result in a vector rQ.").
in response to determining that the question token at the encoder time step identifies an entity that is represented in the knowledge base, linking the variable representing the question token to the entity that is represented in the knowledge base. ([p. 5 §3.1] "Enquirer “fetches” relevant facts from the knowledge-base with HQ (as illustrated in Figure 2). Enquirer first performs term-level matching (similar to the method of associating question-answer pairs with triples described in Section 2) to retrieve a list of relevant candidate triples" term-level matching between knowledge base and questions interpreted as synonymous with linking the variable representing the question token to the entity represented in the knowledge base.). 

Both Song and Yin are directed to using a sequence to sequence encoder-decoder framework for generating responses to questions.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to substitute the unstructured text used in Song with a knowledge base to generate answers as taught by Yin.  Song explicitly references the use of a knowledge base in Yin [p. 7 Col. 1] as a key difference in the taught model.  Yin teaches as a motivation for substitution ([p. 2 §1] “The model is trained on a dataset composed of real world question-answer pairs associated with triples in the knowledge-base, in which all components of the model are jointly tuned. Empirical study shows the proposed model can effectively capture the variation of language and generate right and natural answers to the questions by referring to the facts in the knowledge-base.”).  

Regarding claim 15, claim 15 effectively mirrors claim 3 and is therefore rejected under a similar interpretation.

Claim 5-7 and 17-19 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Song and Li,  in view of Jiwei Li (“A Hierarchical Neural Autoencoder for Paragraphs and Documents”, 2015). 

Regarding claim 5, Song teaches The system of claim 2, when the selected decoder output at the decoder time step is a final token in a computer program expression that identifies a function and one or more arguments to the function: ([p. 3 Col. 2] "we initialize s-1, c-1 and u-1 as zero vectors, and fix x-1 to be the embedding of the sentence start token “<s>”" Token embedding interpreted as synonymous with token function argument.). 
However, Song does not explicitly teach wherein the subsystem is further configured to, at each of the decoder time steps: determine whether the selected decoder output at the decoder time step is a final token in a computer program expression that identifies a function and one or more arguments to the function; and 
execute the function with the one or more arguments as inputs to determine a function output.  

Jiwei Li, in the same field of endeavor teaches wherein the subsystem is further configured to, at each of the decoder time steps: determine whether the selected decoder output at the decoder time step is a final token in a computer program expression that identifies a function and one or more arguments to the function; and ([p. 2 §2 Col. 2] "ht−1 is the representation outputted from the LSTM at time t − 1. Note that each sentence ends up with a special end-of sentence symbol <end>. Commonly, the input and output use two different LSTMs with different sets of convolutional parameters for capturing different compositional patterns. In the decoding procedure, the algorithm terminates when an <end> token is predicted" sentence symbol <end> interpreted as synonymous with final token that identifies a function and one or more arguments to the function.).
execute the function with the one or more arguments as inputs to determine a function output. ([p. 2 §3.1] "An additional ”endD” token is appended to each document...each sentence ending with an “ends” token. The word w is associated with a K-dimensional embedding ew,ew = {e1w, e2w, ..., eKw " Appending function interpreted as function identified by <end> symbol.  Appending interpreted as executing said function.  Embeddings interpreted as synonymous with arguments.). 

Both Song and Jiwei Li are directed to a sequence to sequence LSTM based encoder-decoder framework for generative question answering.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combing the teachings of Song with the teachings of Jiwei Li by implementing the tokens described by Jiwei Li to the decoder token outputs already described in Song.  While the use of tokens are briefly outlined in Song, Jiwei Li gives a much more detailed explanation of the usage of tokens in a generative question answering system and offers further motivation for combination with question answering systems ([p. 8 §5] “the autoencoder described in this work, where input sequenceX is identical to output Y, is only the most basic instance of the family of document (paragraph)-to-document (paragraph) generation tasks. We hope the ideas proposed in this paper can play some role in enabling such more sophisticated generation tasks like summarization, where the inputs are original documents and outputs are summaries or question answering, where inputs are questions and outputs are the actual wording of answers. Sophisticated generation tasks like summarization or dialogue systems could extend this paradigm, and could themselves benefit from task-specific adaptations.”).

Regarding claim 6, the combination of Song, and Jiwei Li teaches The system of claim 5, wherein the subsystem is further configured to, when the selected decoder output at the decoder time step is a final token in a computer program expression that identifies a function (Song [p. 3 Col. 2] "When t = 0, we initialize s-1, c-1 and u-1 as zero vectors, and fix x-1 to be the embedding of the sentence start token “<s>”." Sentence start token interpreted as synonymous with final token that identifies a function.).
and one or more arguments to the function: (Song [p. 3 Col. 2] "When t = 0, we initialize s-1, c-1 and u-1 as zero vectors, and fix x-1 to be the embedding of the sentence start token “<s>”." Embedding interpreted as synonymous with adding arguments to the function.).
add a variable representing the function output to the vocabulary of possible outputs and associate the decoder hidden state at the decoder time step as an encoded representation for the variable. (Song [p. 3 Col. 2] "When t = 0, we initialize s-1, c-1 and u-1 as zero vectors, and fix x-1 to be the embedding of the sentence start token “<s>”. For each time-step t, the decoder first feeds the concatenation of the previous word embedding xt-1 and context vector ct-1 into the LSTM model to update the hidden state" "<s>" interpreted as variable representing the start token function.). 

Regarding claim 7, the combination of Song, and Jiwei Li teaches (Song [p. 3 Col. 2] "we initialize s-1, c-1 and u-1 as zero vectors, and fix x-1 to be the embedding of the sentence start token “<s>”" Sentence start token interpreted as special final output token.).
determine whether the selected decoder output at the decoder time step is the special final output token; and (Jiwei Li [p. 2 §2 Col. 2] "ht−1 is the representation outputted from the LSTM at time t − 1. Note that each sentence ends up with a special end-of sentence symbol <end>. Commonly, the input and output use two different LSTMs with differ ent sets of convolutional parameters for capturing different compositional patterns. In the decoding procedure, the algorithm terminates when an <end> token is predicted" sentence symbol <end> interpreted as synonymous with special final output token that identifies a function and one or more arguments to the function.).
when the selected decoder output at the decoder time step is the special final output token: select a most recently generated function output as a system output for the input sequence. (Jiwei Li [p. 3 Col. 1] "The vector output at the ending time-step is used to represent the entire sentence as [See Eqn]..To build representation eD for the current document/paragraph D, another layer of LSTM (de noted as LSTMsentence/encode ) is placed on top of all sentences, computing representations sequentially" [p. 3 Col.2] "For each timestep t, LSTMsentence decode has to first decide whether decoding should proceed or come to a full stop" document/paragraph D interpreted as synonymous with system output.  Determining that decoding should come to a full stop interpreted as synonymous with generating system output for a most recently generated function output.). 

Regarding claim 17, claim 17 effectively mirrors claim 5 and is therefore rejected under a similar interpretation.

Regarding claim 18, claim 18 effectively mirrors claim 6 and is therefore rejected under a similar interpretation.

Regarding claim 19, claim 19 effectively mirrors claim 7 and is therefore rejected under a similar interpretation.


Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Song and Li, in view of Swait (“Probabilistic choice (models) as a result of balancing multiple goals”, 2013). 

Regarding claim 12, Song teaches The system of claim 11.  
However, Song does not explicitly teach wherein the subsystem is configured to, prior to determining the respective output score for each possible output, set the logit for outputs from the vocabulary of possible outputs that would not be valid outputs for the decoder time step to a value that is mapped to zero by the softmax.  

Swait, in the same field of endeavor teaches wherein the subsystem is configured to, prior to determining the respective output score for each possible output, set the logit for outputs from the vocabulary of possible outputs that would not be valid outputs for the decoder time step to a value that is mapped to zero by the softmax. ([p. 12 Col. 2] "We term our fifth, and final, example of a goal formulation as “goal satisficing”. Assume that this goal is achieved if an alternative’s value surpasses some goal-specific threshold, in which case the result is coded as unity; if the value does not surpass the threshold, the result is coded as negative infinity. That is [Eqn. 31]...A solution to this goal-driven choice can be obtained in different ways. For example, as we did with prior examples, one might assume a weighted combination of the ES and the CSE (choice set entropy— expression (2)) goals to derive the MNL-like formulation below ( given by expression (5), with the weight  now associated with goal ES): [Eqn. 33]"). 

Swait teaches mathematical behavioral conditioning and Song teaches a neural network method.  As neural networks are directed towards mathematical representations of biological systems the two arts are directed towards relevant and overlapping subject matter.  It would have been obvious to a person of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teachings of Song with the teachings of Swait by setting invalid logits to negative infinity to be passed through a softmax filter outputting zero. Passing negative infinity through a softmax filter to return a limit approaching zero is a known mathematical concept and its usage is reinforced in Swait.  Swait summarizes and provides motivation for combination ([p. 4 §1] “Section 2 develops and demonstrates a multiple goal driven representation of individual choice within which high-level goals (specifically, motivations) are given an explicit role in generating probabilistic choice outcomes from deterministic valuations. Specific formulations of goal-driven decision making are shown to underlie the well-known MNL model. Section 3 presents extensions of the framework to include both the nested logit and a MNL variant that explicitly accounts for prior choices or social influences. Section 4 begins with an in-depth discussion of certain aspects of the framework and presents a number of open issues and directions for future research.”).  

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Kaiser (“LEARNING TO REMEMBER RARE EVENTS”, 2017).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SIDNEY VINCENT BOSTWICK whose telephone number is (571)272-4720.  The examiner can normally be reached on M-F 7:30am-5:00pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang can be reached on (571)270-7092.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/SB/Examiner, Art Unit 2124                                                                                                                                                                                                        


/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124