DETAILED ACTION

Introduction
This office action is in response to Applicant’s submission filed on 07/22/2020. Claims
1-20 are pending in the application. As such, claims 1-20 have been examined.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
This application makes reference to or appears to claim subject matter disclosed in Application No. 62877076, filed on 7/22/2019. If applicant desires to claim the benefit of a prior-filed application under 35 U.S.C. 119(e), 120, 121, 365(c)  or 386(c), the instant application must contain, or be amended to contain, a specific reference to the prior-filed application in compliance with  37 CFR 1.78. If the application was filed before September 16, 2012, the specific reference must be included in the first sentence(s) of the specification following the title or in an application data sheet (ADS) in compliance with pre-AIA  37 CFR 1.76; if the application was filed on or after September 16, 2012, the specific reference must be included in an ADS in compliance with 37 CFR 1.76. For benefit claims under 35 U.S.C. 120, 121, 365(c), or 386(c), the reference must include the relationship (i.e., continuation, divisional, or continuation-in-part) of the applications.
If the instant application is a utility or plant application filed under  35 U.S.C. 111(a), the specific reference must be submitted during the pendency of the application and within the later of four months from the actual filing date of the application or sixteen months from the filing date of the prior application. If the application is a national stage application under 35 U.S.C. 371, the specific reference must be submitted during the pendency of the application and within the later of four months from the date on which the national stage commenced under 35 U.S.C. 371(b) or (f), four months from the date of the initial submission under 35 U.S.C. 371 to enter the national stage, or sixteen months from the filing date of the prior application. See 37 CFR 1.78(a)(4) for benefit claims under 35 U.S.C. 119(e) and 37 CFR 1.78(d)(3) for benefit claims under 35 U.S.C. 120, 121, 365(c), or 386(c). This time period is not extendable and a failure to submit the reference required by 35 U.S.C. 119(e) and/or 120, where applicable, within this time period is considered a waiver of any benefit of such prior application(s) under 35 U.S.C. 119(e), 120, 121, 365(c), and 386(c). A benefit claim filed after the required time period may be accepted if it is accompanied by a grantable petition to accept an unintentionally delayed benefit claim under 35 U.S.C. 119(e)  (see 37 CFR 1.78(c)) or under  35 U.S.C. 120, 121, 365(c), or 386(c) (see 37 CFR 1.78(e)). The petition must be accompanied by (1) the reference required by 35 U.S.C. 120 or 119(e) and by 37 CFR 1.78 to the prior application (unless previously submitted), (2) the petition fee under 37 CFR 1.17(m), and (3) a statement that the entire delay between the date the benefit claim was due under 37 CFR 1.78 and the date the claim was filed was unintentional. The Director may require additional information where there is a question whether the delay was unintentional. The petition should be addressed to: Mail Stop Petition, Commissioner for Patents, P.O. Box 1450, Alexandria, Virginia 22313-1450.
If the reference to the prior application was previously submitted within the time period set forth in  37 CFR 1.78  but was not included in the location in the application required by the rule (e.g., if the reference was submitted in an oath or declaration or the application transmittal letter), and the information concerning the benefit claim was recognized by the Office as shown by its inclusion on the first filing receipt, the petition under  37 CFR 1.78 and the petition fee under  37 CFR 1.17(m)  are not required. Applicant is still required to submit the reference in compliance with  37 CFR 1.78  by filing an ADS in compliance with 37 CFR 1.76 with the reference (or, if the application was filed before September 16, 2012, by filing either an amendment to the first sentence(s) of the specification or an ADS in compliance with pre-AIA  37 CFR 1.76). See MPEP § 211.02.

Information Disclosure Statement
The information disclosure statement(s) (IDS) submitted on 7/22/2020, 5/4/2022, 5/23/2022, 5/25/2022, 6/9/2022.  The submissions are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Drawings
The drawings filed on 07/22/2020 have been accepted and considered by the Examiner

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:

A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 1, 2, 5, 6, and 9 are rejected under 35 U.S.C. 103 as being unpatentable over Sordoni, Alessandro, et al. “A hierarchical recurrent encoder-decoder for generative context-aware query suggestion.” proceedings of the 24th ACM international on conference on information and knowledge management. 2015.) hereinafter as Sordoni, in view of Radford, Alec, et al. “Improving language understanding by generative pre-training.” (2018)., hereinafter as Radford.

Regarding claim 1, Sordoni discloses: A computer-implemented method, comprising: initializing a model having a sequence to sequence network architecture, wherein the sequence to sequence network architecture comprises: an encoder; and a decoder ([sect 2] The contribution of our hierarchical recurrent encoder decoder is two-fold.);
training the model based on a training set comprising a plurality of encoder sequences and a plurality of decoder sequences, wherein training the model comprises: generating an encoding of each encoder sequence and each decoder sequence in the training set (Figure 3: The hierarchical recurrent encoder-decoder (HRED). The user types cleveland gallery ! lake erie art. During training, the model encodes cleveland gallery, updates the session-level recurrent state and maximize the probability of seeing the following query lake erie art. The process is repeated for all queries in the session. During testing, a contextual suggestion is generated by encoding the previous queries, by
updating the session-level recurrent states accordingly and by sampling a new query from the last obtained session-level recurrent state. In the example, the generated contextual suggestion is cleveland indian art.);
selecting a subset of the encodings ([sect 1] The query log is partitioned into query sessions, i.e. sequences of queries issued by a unique user and submitted within a short time interval.);
and for each encoding of the encoder sequences: training the encoder using the encoding of the encoder sequence ([Sect 2] The session-level RNN models the sequence of the previous queries, contextualizing the prediction of the next query. Similar contexts are mapped close to each other in the vector space.);
and training the decoder using the encoding of the decoder sequence corresponding to the encoder sequence ([Sect 2] The session-level RNN models the sequence of the previous queries, contextualizing the prediction of the next query. Similar contexts are mapped close to each other in the vector space.);
and generating a prediction based on an input data set using the trained model ([Sect 3.2] Our hierarchical recurrent encoder-decoder (HRED) is pictured in Figure 3. Given a query in the session, the model encodes the information seen up to that position and tries to predict the following query.).

Sordoni does not explicitly, but Radford discloses: appending an informative padding to each of the selected subset of encodings ([sect 3.3] All transformations include adding randomly initialized start and end tokens (<s>, <e>).);
prepending a start of sequence token to each of the encodings of the encoder sequences ([sect 3.3] All transformations include adding randomly initialized start and end tokens (<s>, <e>).);
appending an end of sequence token to each of the encodings of the decoder sequences ([sect 3.3] All transformations include adding randomly initialized start and end tokens (<s>, <e>).);
Sordoni and Radford are considered analogous art because they are all in the related art of natural language understanding.  Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify teaching of Sordoni to combine the teachings of Radford to incorporate appending padding to the start and end of the sequence token to each of the encoding sequences.  Combine these disclosures because it would achieve strong natural language understanding with a single task agnostic model through generative pre-training and discriminative fine-tuning as suggested by Radford (sect 6, Conclusion).

Regarding claim 2, Sordoni in view of Radford discloses: The computer-implemented method of claim 1, 
Radford further discloses: wherein each encoding comprises an attention weight for each token in the encoding ([Sect 3.1] This model applies a multi-headed self-attention operation over the input context tokens followed by position-wise feedforward layers to produce an output distribution over target tokens:).

Regarding claim 5, Sordoni in view of Radford discloses: The computer-implemented method of claim 1, 
Sordoni further discloses: wherein an encoding of a sequence comprises a vector representation the sequence ([sect 2]  Vector representations of words and phrases, also known as embeddings, have been successfully used to encode syntactic or semantic characteristics thereof [3, 4, 25, 35]. We focus on how to capture query similarity and query term similarity by means of such embeddings.).

Regarding claim 6, Sordoni in view of Radford discloses: The computer-implemented method of claim 1, 
Sordoni additionally discloses: wherein generating the prediction comprises: generating an input encoding of the input data ([sect 1] For each word in the query, the RNN takes as input its embedding and updates an internal vector, called recurrent state, that can be viewed as an order-sensitive summary of all the information seen up to that word. The first recurrent state is usually set to the zero vector. After the last word has been processed, the recurrent state can be considered as a compact order-sensitive encoding of the query (Figure 2 (a)).);
generating an output sequence comprising a start of sequence token ([sect 1] The process ends when we obtain k well-formed queries containing the special end-of-query token );
completing the output sequence by: generating a next output sequence token by providing the input encoding to the trained model ([sect 2] When a word is sampled, the recurrent state is updated to take into account the generated word. The process continues until the end-of-query symbol is produced.);
appending the next output sequence token to the output sequence ([sect 2] When a word is sampled, the recurrent state is updated to take into account the generated word. The process continues until the end-of-query symbol is produced.);
iteratively generating next output sequence tokens by providing the input encoding to the trained model and appending each generated next output sequence token to the output sequence until the generated subsequent next output sequence token comprises an end of sequence token ([sect 3.4] The solution to the problem can be approximated using standard word level decoding techniques such as beam-search [11, 23]. We iteratively consider a set of k best prefixes up to length n as candidates and we extend each of them by sampling the most probable k words given the distribution in Eq. 9. We obtain k2 queries of length n + 1 and keep only the k best of them. The process ends when we obtain k well-formed queries containing the special end-of-query token );
and generating the prediction based on the output sequence ([sect 1] Query suggestions can be mined by sampling likely continuations given one or more queries as context. Prediction is efficient and can be performed using standard natural language processing word-level decoding techniques [23]. The model is robust to long-tail effects as the prefix is considered as a sequence of words that share statistical weight and not as a sequence of atomic queries.).

Regarding claim 9, Sordoni in view of Radford discloses: The computer-implemented method of claim 1, 
Sordoni additionally discloses: wherein: the training set comprises a vocabulary ([sect 4.2] The most frequent 90K words in the background set form our vocabulary V.);
and the encoding for a sequence comprises one hundred percent coverage for the vocabulary ([sect 5] Similarly, our model can handle rare queries as long as their words appear in the model vocabulary.)

Claim 3, 4 ,11, 12, 13, 14, 17, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Sordoni, in view of Radford, further in view of Cheng et al, (“Long short-term memory-networks for machine reading.” arXiv preprint arXiv:1601.06733 (2016).) Hereinafter as Cheng.

Regarding claim 3, Sordoni in view of Radford discloses: The computer-implemented method of claim 2,
Sordoni in view of Radford does not explicitly, but Cheng discloses: wherein training the encoder comprises updating the attention weight for at least one token in the encoding of the encoder sequence ([sect 1] The idea is to use multiple memory slots outside the recurrence to piece-wise store representations of the input; read and write operations for each slot can be modeled as an attention mechanism with a recurrent controller. We also leverage memory and attention to empower a recurrent network with stronger memorization capability and more importantly the ability to discover relations among tokens. This is realized by inserting a memory network module in the update of a recurrent network together with attention for memory addressing.  [sect 4] In this section we explain how to combine the LSTMN which applies attention for intra-relation reasoning, with the encoder-decoder network whose attention module learns the inter-alignment between two sequences. Figures 3a and 3b illustrate two types of combination. We describe the models more formally below.).  
Sordoni, Radford, and Cheng are considered analogous art because they are all in the related art of natural language understanding.  Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify teaching of Sordoni, in view of Radford, to combine the teachings of Cheng to incorporate updating attention weight.  Combine these disclosures because it would perform shallow structure reasoning over input streams as suggested by Cheng (sect 2).

Regarding claim 4, Sordoni in view of Radford discloses: The computer-implemented method of claim 2,
Sordoni in view of Radford does not explicitly, but Cheng discloses: wherein training the decoder comprises updating the attention weight for at least one token in the encoding for the decoder sequence corresponding to the encoder sequence ([sect 1] The idea is to use multiple memory slots outside the recurrence to piece-wise store representations of the input; read and write operations for each slot can be modeled as an attention mechanism with a recurrent controller. We also leverage memory and attention to empower a recurrent network with stronger memorization capability and more importantly the ability to discover relations among tokens. This is realized by inserting a memory network module in the update of a recurrent network together with attention for memory addressing.  [sect 4] In this section we explain how to combine the LSTMN which applies attention for intra-relation reasoning, with the encoder-decoder network whose attention module learns the inter-alignment between two sequences. Figures 3a and 3b illustrate two types of combination. We describe the models more formally below.).  
Sordoni, Radford, and Cheng are considered analogous art because they are all in the related art of natural language understanding.  Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify teaching of Sordoni, in view of Radford, to combine the teachings of Cheng to incorporate updating attention weight.  Combine these disclosures because it would perform shallow structure reasoning over input streams as suggested by Cheng (sect 2).

Regarding claim 11, Sordoni discloses: A device, comprising: a processor; and a memory in communication with the processor and storing instructions that, when read by the processor, cause the device to: initialize a model having a sequence to sequence network architecture (See Fig. 1, and Fig.2, where it discuss word/query embeddings, neural network architecture and encoder/decoder architecture, it would implied that a processor with memory, storing instructions capabilities are used.),
wherein the sequence to sequence network architecture comprises: an encoder; and a decoder ([sect 2] The contribution of our hierarchical recurrent encoder decoder
is two-fold.);
train the model based on a training set comprising a plurality of encoder sequences and a plurality of decoder sequences (Figure 3: The hierarchical recurrent encoder-decoder (HRED). The user types cleveland gallery ! lake erie art. During training, the model encodes cleveland gallery, updates the session-level recurrent state and maximize the probability of seeing the following query lake erie art. The process is repeated for all queries in the session. During testing, a contextual suggestion is generated by encoding the previous queries, by updating the session-level recurrent states accordingly and by sampling a new query from the last obtained session-level recurrent state. In the example, the generated contextual suggestion is cleveland indian art.),
selecting a subset of the encodings ([sect 1] The query log is partitioned into query sessions, i.e. sequences of queries issued by a unique user and submitted within a short time interval.);
and for each encoding of the encoder sequences: training the encoder using the encoding of the encoder sequence ([Sect 2] The session-level RNN models the sequence of the previous queries, contextualizing the prediction of the next query. Similar contexts are mapped close to each other in the vector space.);
and training the decoder using the encoding of the decoder sequence corresponding to the encoder sequence ([Sect 2] The session-level RNN models the sequence of the previous queries, contextualizing the prediction of the next query. Similar contexts are mapped close to each other in the vector space.);
and generate a prediction based on an input data set using the trained model	 ([Sect 3.2] Our hierarchical recurrent encoder-decoder (HRED) is pictured in Figure 3. Given a query in the session, the model encodes the information seen up to that position and tries to predict the following query.).
Sordoni does not explicitly, but Radford discloses: wherein training the model comprises: generating an encoding of each encoder sequence and each decoder sequence in the training set, wherein an encoding comprises an attention weight ([Sect 3.1] This model applies a multi-headed self-attention operation over the input context tokens followed by position-wise feedforward layers to produce an output distribution over target tokens:);
appending an informative padding to each of the selected subset of encodings ([sect 3.3] All transformations include adding randomly initialized start and end tokens (<s>, <e>).);
prepending a start of sequence token to each of the encodings of the encoder sequences ([sect 3.3] All transformations include adding randomly initialized start and end tokens (<s>, <e>).);
appending an end of sequence token to each of the encodings of the decoder sequences ([sect 3.3] All transformations include adding randomly initialized start and end tokens (<s>, <e>).);
Sordoni and Radford are considered analogous art because they are all in the related art of natural language understanding.  Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify teaching of Sordoni to combine the teachings of Radford to incorporate appending padding to the start and end of the sequence token to each of the encoding sequences.  Combine these disclosures because it would achieve strong natural language understanding with a single task agnostic model through generative pre-training and discriminative fine-tuning as suggested by Radford (sect 6, Conclusion).
Sordoni in view of Radford does not explicitly, but Cheng discloses: updating the attention weight for the encoder sequence based on the training ([sect 1] The idea is to use multiple memory slots outside the recurrence to piece-wise store representations of the input; read and write operations for each slot can be modeled as an attention mechanism with a recurrent controller. We also leverage memory and attention to empower a recurrent network with stronger memorization capability and more importantly the ability to discover relations among tokens. This is realized by inserting a memory network module in the update of a recurrent network together with attention for memory addressing.  [sect 4] In this section we explain how to combine the LSTMN which applies attention for intra-relation reasoning, with the encoder-decoder network whose attention module learns the inter-alignment between two sequences. Figures 3a and 3b illustrate two types of combination. We describe the models more formally below.);
Sordoni, Radford, and Cheng are considered analogous art because they are all in the related art of natural language understanding.  Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify teaching of Sordoni, in view of Radford, to combine the teachings of Cheng to incorporate updating attention weight.  Combine these disclosures because it would perform shallow structure reasoning over input streams as suggested by Cheng (sect 2).

Regarding claim 12, Sordoni in view of Radford, further in view of Cheng discloses: The device of claim 11,
Cheng further discloses: wherein training the decoder comprises updating the attention weight for at least one token in the encoding for the decoder sequence corresponding to the encoder sequence ([sect 1] The idea is to use multiple memory slots outside the recurrence to piece-wise store representations of the input; read and write operations for each slot can be modeled as an attention mechanism with a recurrent controller. We also leverage memory and attention to empower a recurrent network with stronger memorization capability and more importantly the ability to discover relations among tokens. This is realized by inserting a memory network module in the update of a recurrent network together with attention for memory addressing.  [sect 4] In this section we explain how to combine the LSTMN which applies attention for intra-relation reasoning, with the encoder-decoder network whose attention module learns the inter-alignment between two sequences. Figures 3a and 3b illustrate two types of combination. We describe the models more formally below.).  

Regarding claim 13, Sordoni in view of Radford, further in view of Cheng discloses: The device of claim 11,
Sordoni further discloses: wherein an encoding of a sequence comprises a vector representation the sequence ([sect 2]  Vector representations of words and phrases, also known as embeddings, have been successfully used to encode syntactic or semantic characteristics thereof [3, 4, 25, 35]. We focus on how to capture query similarity and query term similarity by means of such embeddings.).  

Regarding claim 14, Sordoni in view of Radford, further in view of Cheng discloses: The device of claim 11,
Sordoni additionally discloses: wherein in the instructions, when read by the processor, further cause the device to generate the prediction by causing the device to:  generating an input encoding of the input data ([sect 1] For each word in the query, the RNN takes as input its embedding and updates an internal vector, called recurrent state, that can be viewed as an order-sensitive summary of all the information seen up to that word. The first recurrent state is usually set to the zero vector. After the last word has been processed, the recurrent state can be considered as a compact order-sensitive encoding of the query (Figure 2 (a)).);
generating an output sequence comprising a start of sequence token ([sect 1] The process ends when we obtain k well-formed queries containing the special end-of-query token );
completing the output sequence by: generating a next output sequence token by providing the input encoding to the trained model ([sect 2] When a word is sampled, the recurrent state is updated to take into account the generated word. The process continues until the end-of-query symbol is produced.);
appending the next output sequence token to the output sequence ([sect 2] When a word is sampled, the recurrent state is updated to take into account the generated word. The process continues until the end-of-query symbol is produced.);
iteratively generating next output sequence tokens by providing the input encoding to the trained model and appending each generated next output sequence token to the output sequence until the generated subsequent next output sequence token comprises an end of sequence token ([sect 3.4] The solution to the problem can be approximated using standard word level decoding techniques such as beam-search [11, 23]. We iteratively consider a set of k best prefixes up to length n as candidates and we extend each of them by sampling the most probable k words given the distribution in Eq. 9. We obtain k2 queries of length n + 1 and keep only the k best of them. The process ends when we obtain k well-formed queries containing the special end-of-query token );
and generating the prediction based on the output sequence ([sect 1] Query suggestions can be mined by sampling likely continuations given one or more queries as context. Prediction is efficient and can be performed using standard natural language processing word-level decoding techniques [23]. The model is robust to long-tail effects as the prefix is considered as a sequence of words that share statistical weight and not as a sequence of atomic queries.).

Regarding claim 17, Sordoni in view of Radford, further in view of Cheng discloses: The device of claim 11,
Sordoni further discloses: wherein: the training set comprises a vocabulary ([sect 4.2] The most frequent 90K words in the background set form our vocabulary V.);
and the encoding for a sequence comprises one hundred percent coverage for the vocabulary ([sect 5] Similarly, our model can handle rare queries as long as their words appear in the model vocabulary.);

Regarding claim 18, Sordoni discloses: A computer-implemented method, comprising: initializing a model having a sequence to sequence network architecture (See Fig. 2(a) where the left side shows the initialization of the RNN model),
wherein the sequence to sequence network architecture comprises: an encoder; and a decoder (The contribution of our hierarchical recurrent encoder decoder is two-fold.);
training the model based on a training set comprising a plurality of encoder sequences and a plurality of decoder sequences (Figure 3: The hierarchical recurrent encoder-decoder (HRED). The user types cleveland gallery ! lake erie art. During training, the model encodes cleveland gallery, updates the session-level recurrent state and maximize the probability of seeing the following query lake erie art. The process is repeated for all queries in the session. During testing, a contextual suggestion is generated by encoding the previous queries, by updating the session-level recurrent states accordingly and by sampling a new query from the last obtained session-level recurrent state. In the example, the generated contextual suggestion is cleveland indian art."),
selecting a subset of the encodings ([sect 1] The query log is partitioned into query sessions, i.e. sequences of queries issued by a unique user and submitted within a short time interval.);
and for each encoding of the encoder sequences: training the encoder using the encoding of the encoder sequence ([Sect 2] The session-level RNN models the sequence of the previous queries, contextualizing the prediction of the next query. Similar contexts are mapped close to each other in the vector space.);
and training the decoder using the encoding of the decoder sequence corresponding to the encoder sequence ([Sect 2] The session-level RNN models the sequence of the previous queries, contextualizing the prediction of the next query. Similar contexts are mapped close to each other in the vector space.);
obtaining input data; generating an input encoding of the input data ([sect 1] For each word in the query, the RNN takes as input its embedding and updates an internal vector, called recurrent state, that can be viewed as an order-sensitive summary of all the information seen up to that word. The first recurrent state is usually set to the zero vector. After the last word has been processed, the recurrent state can be considered as a compact order-sensitive encoding of the query (Figure 2 (a)).);
generating an output sequence comprising a start of sequence token ([sect 1] The process ends when we obtain k well-formed queries containing the special end-of-query token);
 completing the output sequence by: generating a next output sequence token by providing the input encoding to the trained model ([sect 2] When a word is sampled, the recurrent state is updated to take into account the generated word. The process continues until the end-of-query symbol is produced.);
appending the next output sequence token to the output sequence ([sect 2] When a word is sampled, the recurrent state is updated to take into account the generated word. The process continues until the end-of-query symbol  is produced.);
iteratively generating next output sequence tokens by providing the input encoding to the trained model and appending each generated next output sequence token to the output sequence until the generated subsequent next output sequence token comprises an end of sequence token ([sect 3.4] The solution to the problem can be approximated using standard word level decoding techniques such as beam-search [11, 23]. We iteratively consider a set of k best prefixes up to length n as candidates and we extend each of them by sampling the most probable k words given the distribution in Eq. 9. We obtain k2 queries of length n + 1 and keep only the k best of them. The process ends when we obtain k well-formed queries containing the special end-of-query token);
and generating a prediction based on the output sequence ([sect 1] Query suggestions can be mined by sampling likely continuations given one or more queries as context. Prediction is efficient and can be performed using standard natural language processing word-level decoding techniques [23]. The model is robust to long-tail effects as the prefix is considered as a sequence of words that share statistical weight and not as a sequence of atomic queries.).
Sordoni does not explicitly, but Radford discloses: wherein training the model comprises: generating an encoding of each encoder sequence and each decoder sequence in the training set, wherein an encoding comprises an attention weight ([Sect 3.1] This model applies a multi-headed self-attention operation over the input context tokens followed by position-wise feedforward layers to produce an output distribution over target tokens:);
appending an informative padding to each of the selected subset of encodings ([sect 3.3] All transformations include adding randomly initialized start and end tokens (<s>, <e>).);
prepending a start of sequence token to each of the encodings of the encoder sequences ([sect 3.3] All transformations include adding randomly initialized start and end tokens (<s>, <e>).);
appending an end of sequence token to each of the encodings of the decoder sequences([sect 3.3] All transformations include adding randomly initialized start and end tokens (<s>, <e>).);
Sordoni and Radford are considered analogous art because they are all in the related art of natural language understanding.  Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify teaching of Sordoni to combine the teachings of Radford to incorporate appending padding to the start and end of the sequence token to each of the encoding sequences.  Combine these disclosures because it would achieve strong natural language understanding with a single task agnostic model through generative pre-training and discriminative fine-tuning as suggested by Radford (sect 6, Conclusion).
Sordoni in view of Radford does not explicitly, but Cheng discloses: updating the attention weight for the encoder sequence based on the training ([sect 1] The idea is to use multiple memory slots outside the recurrence to piece-wise store representations of the input; read and write operations for each slot can be modeled as an attention mechanism with a recurrent controller. We also leverage memory and attention to empower a recurrent network with stronger memorization capability and more importantly the ability to discover relations among tokens. This is realized by inserting a memory network module in the update of a recurrent network together with attention for memory addressing.  [sect 4] In this section we explain how to combine the LSTMN which applies attention for intra-relation reasoning, with the encoder-decoder network whose attention module learns the inter-alignment between two sequences. Figures 3a and 3b illustrate two types of combination. We describe the models more formally below.);
Sordoni, Radford, and Cheng are considered analogous art because they are all in the related art of natural language understanding.  Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify teaching of Sordoni, in view of Radford, to combine the teachings of Cheng to incorporate updating attention weight.  Combine these disclosures because it would perform shallow structure reasoning over input streams as suggested by Cheng (sect 2).

Claim 7 and 8 are rejected under 35 U.S.C. 103 as being unpatentable over Sordoni, in view of Radford, further in view of Serban, Iulian, et al. (“A hierarchical latent variable encoder-decoder model for generating dialogues.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 31. No. 1. 2017.) Hereinafter as Serban.
Regarding claim 7, Sordoni in view of Radford discloses: The computer-implemented method of claim 1,
Sordoni in view of Radford does not explicitly, but Serban discloses: wherein the encoder sequences comprise a set of dialog prompts (pg. 2, left col 4th para] HRED models each output sequence with a two-level hierarchy: a sequence of sub-sequences, and sub-sequences of tokens. In particular, a dialogue is modelled as a sequence of utterances (subsequences), with each utterance being a sequence of words:HRED consists of three RNN modules: an encoder RNN, a context RNN and a decoder RNN. Each utterance is deterministically encoded into a realvalued vector by the encoder RNN: .See Table 2: Twitter examples for the neural network models. The → token indicates a change of turn.).
Sordoni, Radford, and Serban are considered analogous art because they are all in the related art of natural language understanding.  Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify teaching of Sordoni, in view of Radford, to combine the teachings of Serban to incorporate a set of dialog prompts.  Combine these disclosures because results demonstrate that the model substantially improves upon earlier models, and further highlight how the latent variables facilitate the generation of long utterances, with higher information content, and maintain dialogue context, as suggested by Serban (Introduction).
Regarding claim 8, Sordoni in view of Radford, further in view of Serban discloses: The computer-implemented method of claim 7,
Serban further discloses: wherein the decoder sequences comprise a set of dialog responses (pg. 2, left col 4th para] HRED models each output sequence with a two-level hierarchy: a sequence of sub-sequences, and sub-sequences of tokens. In particular, a dialogue is modelled as a sequence of utterances (subsequences), with each utterance being a sequence of words:HRED consists of three RNN modules: an encoder RNN, a context RNN and a decoder RNN. Each utterance is deterministically encoded into a realvalued vector by the encoder RNN: .See Table 2: Twitter examples for the neural network models. The → token indicates a change of turn.).

Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Sordoni, in view of Radford, further in view of Chen et al. ((2019, May). Sequential matching model for end-to-end multi-turn response selection. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7350-7354). IEEE.) Hereinafter as Chen.
Regarding claim 10, Sordoni in view of Radford discloses: The computer-implemented method of claim 1,
Sordoni in view of Radford does not explicitly, but Chen discloses: wherein the model is configured to generate predictions regarding multi-turn dialogs ([sect 3] The multi-turn response selection task is to select the next utterance from a candidate pool, given a multi-turn context. We solve the problem by converting it to a binary classification task, which is similar to previous work [5, 6]. Given a multi-turn context and a candidate response, our model needs to determine whether the candidate response is the proper next utterance. In this section, we will introduce our model, which is originally developed for natural language inference, i.e., Enhanced Sequential Inference Model (ESIM) [11]. The model consists of three main components, i.e., input encoding, local matching, and matching composition, as shown in Figure 1.).
Sordoni, Radford, and Chen are considered analogous art because they are all in the related art of natural language understanding.  Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify teaching of Sordoni, in view of Radford, to combine the teachings of Chen to incorporate model configured to generate predictions regarding multi-turn dialogs.  Combine these disclosures because ESIM can effectively utilize long multi-turn context information, which may be lost in hierarchy-based methods due to the limited utterance length. Truncating the context in the reverse direction leads to performance improvement, which shows that the last few utterances in context are more important than the first few utterances, as suggested by Chen (Sect 4.4).

Claim 15, 16, 19 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Sordoni, in view of Radford, further in view of Cheng, furthermore in view of Serban.

Regarding claim 15, Sordoni in view of Radford, further in view of Cheng discloses: The device of claim 11,
Sordoni in view of Radford, further in view of Cheng does not explicitly, but Serban discloses: wherein the encoder sequences comprise a set of dialog prompts ([pg. 2, left col 4th para] HRED models each output sequence with a two-level hierarchy: a sequence of sub-sequences, and sub-sequences of tokens. In particular, a dialogue is modelled as a sequence of utterances (subsequences), with each utterance being a sequence of words:HRED consists of three RNN modules: an encoder RNN, a context RNN and a decoder RNN. Each utterance is deterministically encoded into a realvalued vector by the encoder RNN: .See Table 2: Twitter examples for the neural network models. The → token indicates a change of turn.).  
Sordoni, Radford, Cheng, and Serban are considered analogous art because they are all in the related art of natural language understanding.  Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify teaching of Sordoni, in view of Radford, further in view of Cheng, to combine the teachings of Serban to incorporate a set of dialog prompts.  Combine these disclosures because results demonstrate that the model substantially improves upon earlier models, and further highlight how the latent variables facilitate the generation of long utterances, with higher information content, and maintain dialogue context, as suggested by Serban (Introduction).

Regarding claim 16, Sordoni in view of Radford, further in view of Cheng, furthermore in view of Serban, discloses: The device of claim 15,
Serban further discloses: wherein the decoder sequences comprise a set of dialog responses ([pg. 2, left col 4th para] HRED models each output sequence with a two-level hierarchy: a sequence of sub-sequences, and sub-sequences of tokens. In particular, a dialogue is modelled as a sequence of utterances (subsequences), with each utterance being a sequence of words:HRED consists of three RNN modules: an encoder RNN, a context RNN and a decoder RNN. Each utterance is deterministically encoded into a realvalued vector by the encoder RNN: .See Table 2: Twitter examples for the neural network models. The → token indicates a change of turn.).  

Regarding claim 19, Sordoni in view of Radford, further in view of Cheng discloses: The computer-implemented method of claim 18,
Sordoni in view of Radford, further in view of Cheng does not explicitly, but Serban discloses: wherein the encoder sequences comprise a set of dialog prompts ([pg. 2, left col 4th para] HRED models each output sequence with a two-level hierarchy: a sequence of sub-sequences, and sub-sequences of tokens. In particular, a dialogue is modelled as a sequence of utterances (subsequences), with each utterance being a sequence of words:HRED consists of three RNN modules: an encoder RNN, a context RNN and a decoder RNN. Each utterance is deterministically encoded into a realvalued vector by the encoder RNN: .See Table 2: Twitter examples for the neural network models. The → token indicates a change of turn.).  
Sordoni, Radford, Cheng and Serban are considered analogous art because they are all in the related art of natural language understanding.  Therefore, it would have been obvious to one of ordinary skilled in the art before the effective filing date of the claimed invention to modify teaching of Sordoni, in view of Radford, further in view of Cheng, to combine the teachings of Serban to incorporate a set of dialog prompts.  Combine these disclosures because results demonstrate that the model substantially improves upon earlier models, and further highlight how the latent variables facilitate the generation of long utterances, with higher information content, and maintain dialogue context, as suggested by Serban (Introduction).

Regarding claim 20, Sordoni in view of Radford, further in view of Cheng, furthermore in view of Serban discloses: The computer-implemented method of claim 19,
Serban further discloses: wherein the decoder sequences comprise a set of dialog responses ([pg. 2, left col 4th para] HRED models each output sequence with a two-level hierarchy: a sequence of sub-sequences, and sub-sequences of tokens. In particular, a dialogue is modelled as a sequence of utterances (subsequences), with each utterance being a sequence of words:HRED consists of three RNN modules: an encoder RNN, a context RNN and a decoder RNN. Each utterance is deterministically encoded into a realvalued vector by the encoder RNN: .See Table 2: Twitter examples for the neural network models. The → token indicates a change of turn.).  






Conclusion

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Goyal et al. (US Patent Application Publication No: US 20180203852 A1) hereinafter as Goyal.  Goyal discloses a method and system to perform natural language generation through character-based recurrent neural network with finite-state prior knowledge ([0014] “In accordance with one aspect of the exemplary embodiment, a method for generation of a target sequence from a semantic representation. The method includes adapting a target background model, built from a vocabulary of words, to form an adapted background model, which accepts subsequences of an input semantic representation, the input semantic representation including a sequence of characters. The input semantic representation is represented as a sequence of character embeddings. The character embeddings are encoded, with an encoder, to generate a set of character representations. With a decoder, a target sequence of characters is generated, based on the set of character representations. This includes, at a plurality of time steps, generating a next character in the target sequence as a function of a previously generated character of the target sequence and the adapted background model.”  Please also see Figs 2-4, 6, and para [0015-0017] and also [0028-0036].

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Phillip H Lam whose telephone number is (571)272-1721. The examiner can normally be reached 10AM-6 PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on (571) 272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/PHILIP H LAM/Examiner, Art Unit 2656                                                                                                                                                                                                        
	/EDGAR X GUERRA-ERAZO/            Primary Examiner, Art Unit 2656