DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Election/Restrictions
Newly submitted claim 2 to 8 and 10 to 11 are directed to an invention that is independent or distinct from the invention originally claimed for the following reasons: 
Restriction to one of the following inventions is required under 35 U.S.C. 121:
I. Claims 2 to 8 and 10 to 11, drawn to a computer implemented method comprising, as performed by a computing system comprising one or more computer processors configured to execute specific instructions, receiving from a computing device audio data representing an utterance, receiving contextual data, wherein the contextual data represents a plurality of content items displayed by the computing device when the utterance occurred, and wherein a first content item of the plurality of content items is associated with domain data representing a first domain of a plurality of domains, generating automatic speech automatic speech recognition (‘ASR’) data using the audio data and a language model, generating natural language understanding (‘NLU’) input data for an NLU subsystem using the ASR data and the contextual data, wherein the NLU input data comprising first neural network input data representing at least a portion of the utterance, second neural network input data that indicates content associated with the first domain was displayed when the utterance occurred, third neural network input data comprising a plurality of elements, wherein a first element of , classified in G10L 15/16.
II. Claims 12, 14, 21 to 24, 26 to 28, and 30 to 31 drawn to a system comprising a computer-readable memory storing executable instructions, and one or more processors in communication with the computer-readable memory and configured by the executable instructions to at least receive from a computing device audio representing an utterance, generate automatic speech recognition (‘ASR’) data using an ASR subsystem and a language model, receiving contextual data from an application subsystem after at least a portion of the ASR data is generated by the ASR subsystem, wherein the contextual data represents a plurality of content items displayed by the computing device when the utterance occurred, and wherein a first content item of the plurality of content items is associated with domain data representing a first domain of a plurality of domains, generate natural language understanding (‘NLU’) input data for an NLU subsystem using the ASR data and the contextual data, wherein the NLU input data comprises first input data representing at least a portion of the utterance, second input data that indicates content associated with the first domain was displayed when the utterance occurred, and third input data comprising a plurality of elements, wherein a first element of the plurality of elements represents the first content item, and wherein , classified in G10L 15/183.
The inventions are independent or distinct, each from the other because:
Inventions I and II are related as subcombinations disclosed as usable together in a single combination.  The subcombinations are distinct if they do not overlap in scope and are not obvious variants, and if it is shown that at least one subcombination is separately usable.  In the instant case, Invention I has separate utility with the neural network using first neural network input data, second neural network input data, and third neural network input data, but Invention II has separate utility for receiving contextual data after a portion of the ASR data is generated with a language model to generate automatic speech recognition data.  That is, Invention I has separate utility as directed to use with a neural network, and Invention II has separate utility as directed to receiving contextual data after generating ASR data.  See MPEP §806.05(d).
Invention II is being designated as the originally-elected invention because it is closest to the prior claim language.  Invention II includes claim 31 which could be construed as a linking claim to link together Inventions I and II, but these inventions remain patentably distinct due to the limitations of generating automatic speech recognition data before receiving contextual data for Invention II.  That is, an ordering of the steps is different between Inventions I and II.  MPEP §809 states: “Where an application includes claims to distinct inventions as well as linking claims, restriction can 
Restriction for examination purposes as indicated is proper because all the inventions listed in this action are independent or distinct for the reasons given above and there would be a serious search and/or examination burden if restriction were not required because one or more of the following reasons apply:
There would be a serious burden on quality examination if these two inventions were examined together due to the complexity of the issues and the separate areas of search.  Invention I raises an issue of improper written description under 35 U.S.C. §112(a) for independent claim 2 that is not required for Invention II with independent claim 12.  Additionally, Applicants are arguing a somewhat subtle difference as being significant for Invention II as directed to generating automatic speech recognition data with a language model prior to receiving contextual data, and this limitation is not presented by Invention I.  Given the already lengthy nature of the Office Action, there would be a serious burden on quality examination if these two inventions were examined together due to the complexity of the issues.
Since Applicants has received an action on the merits for the originally presented invention, this invention has been constructively elected by original presentation for prosecution on the merits.  Accordingly, claims 2 to 8 and 10 to 11 are withdrawn from consideration as being directed to a non-elected invention.  See 37 CFR 1.142(b) and MPEP § 821.03.


Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claim 31 is rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement.  The claim contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventors, at the time the application was filed, had possession of the claimed invention.
Claim 31 is improper as setting forth new matter and as omitting ‘essential subject matter’ under MPEP §2172.02.  Specifically, claim 31 sets forth limitations directed to “wherein the first input data comprises first neural network input data, wherein the second input data comprises second neural network input data, and wherein the third input data comprises third neural network input data”, where “first neural network input data”, “second neural network input data”, and “third neural network input data” are new matter because they are not described in the originally-filed Specification.  Generally, Applicants’ Specification, ¶[0015], ¶[0016], ¶[0087], ¶[0145], 
Moreover, MPEP §2172.01 states that a claim that omits matter disclosed to be essential may be rejected under 35 U.S.C. §112(a).  Here, ‘essential subject matter’ that is omitted from this claim is a neural network.  The claim does not set forth a neural network, but only “neural network input data”.  The claim is improper because construction of ‘neural network input data’ cannot be properly performed in the absence of a neural network to receive this input data.  A “neural network” is ‘essential matter’ that is omitted from the claim language for “first neural network input data”, “second neural network input data”, and “third neural network input data”.  





Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 12, 21, and 28 are rejected under 35 U.S.C. 103 as being unpatentable over Mathias et al. (U.S. Patent Publication 2015/0302002) in view of Han et al. (U.S. Patent Publication 2011/0029301).
Concerning independent claim 12, Mathias et al. discloses a system for multi-domain natural language processing, comprising:
“computer-readable memory storing executable instructions; and one or more processors in communication with the computer readable memory and configured by the executable instructions to at least:” – a method, processing, routine, or algorithm can be embodied in a software module executed by a processor, and an exemplary storage medium can be coupled to a processor so that the processor can read information from a non-transitory computer-readable medium (¶[0066]);
“receive, from a computing device, audio representing an utterance” – a user utterance is processed with respect to multiple subject matter domains (Abstract); a user may issue spoken commands or make spoken utterances to a client device (¶[0016]); input from a client device 102 may be a user utterance transmitted to spoken processing system 100 (¶[0033]: Figure 1);

“receive contextual data from an application subsystem after at least a portion of the ASR data is generated by the ASR subsystem [, wherein the contextual data represents a plurality of content items displayed by the computing device when the utterance occurred, and wherein a first content item of the plurality of content items is associated with domain data representing a first domain of a plurality of domains]” – each user interaction with a user device or spoken language processing system may create history, or context, that can be used to determine the user’s intent; specific hints may be generated regarding what the user is likely to do next; a user device may create history, or context, that can be used to determine the user’s intent when processing a subsequent utterance (¶[0014]); context ranker 208 can determine which result is the most appropriate and choose a single domain or result with which to proceed, or context ranker 208 may generate an N-best list of likely user intents (¶[0038]: Figure 3); context 
“generate natural language understanding (‘NLU’) input data for an NLU subsystem using the ASR data and the contextual data, wherein the NLU input data comprises: first input data representing at least a portion of the utterance” – multi-domain natural language understanding (“NLU”) engine may process the transcriptions in multiple individual domains; additionally, hints may generated based on previous use interactions and other data, and multi-domain NLU engine may use the hints to more efficiently process input or more accurately generate output (Abstract); spoken language processing system 100 may use a multi-domain NLU engine to determine what the user would like to do, also, known as the user intent, based on the transcription from the ASR module (“the ASR data”); the multi-domain NLU engine may also consider hints or history (“the contextual data”) based on previous user interactions or other data when determining the user intent (¶[0018]: Figure 1);
“generate NLU output data using the NLU subsystem, the first input data, [and the second input data, and the third input data,] wherein the NLU output data represents a correspondence of the utterance to intent data associated with the first domain” – modules of a spoken language processing system including a natural language understanding (“NLU”) module may interpret the user’s words to determine what action the user would like to initiate, known as a user intent (¶[0011]); aspects relate to 
“sending the intent data to the first domain” – a multi-domain NLU engine can then select a particular domain specific result on which the spoken language processing system 100 will base its response; the selection may be based on a likelihood that each individual result is reflective of the user’s actual intent (¶[0019]: Figure 1); results obtained from other domain-specific NLU modules may be less likely to reflect the user’s actual intent; as a result, spoken language processing system 100 may decide to produce a response to the utterance based on the result returned by a directions domain-specific NLU module (¶[0020]: Figure 1); a most likely interpretation of user intent may be selected from the N-best list from the various single-domain NLU modules; a response can vary depending upon which domain provided the most likely analysis of the user’s intent; if the music domain produced an analysis of the utterance with the highest score, then an executable command to play the requested music may be the most appropriate response (¶[0060] - ¶[0061]: Figure 5: Steps 510 to 512).
Mathias et al. provides a basic framework of conventional natural language processing that includes an automatic speech recognition module using a language model to generate a transcription of a user utterance and a natural language understanding module that receives the transcription and generates output that selects an intent and a domain, where the natural language understanding module may use contextual data to determine an intent and a domain.  However, Mathias et al. only states that contextual data may comprise a history of interactions, or some unspecified other data, and does not disclose “wherein the contextual data represents a plurality of content items displayed by the computing device when the utterance occurred, and wherein a first content item of the plurality of content items is associated with domain data representing a first domain of a plurality of domains”.  That is, Mathias et al.’s contextual data that is used to determine an intent and a domain does not relate to what is currently being displayed on a user device when a user spoke an utterance.  Similarly, Mathias et al. does not disclose “second input data that indicates content associated with the first domain was displayed when the utterance occurred”, “third input data comprising a plurality of elements, wherein a first element of the plurality of elements represents the first content item, and wherein a second element of the plurality of elements represents a second content item of the plurality of content items”, or generating NLU output data using “the second input data, and the third input data”.  Here, “second input data” appears to relate to whether or not a particular content item element of “third input data” was displayed or not displayed.  Compare Specification, ¶[0142]: Figure 1.   
Han et al. teaches whatever limitations may be omitted by Mathias et al.  Generally, Han et al. teaches recognizing speech according to dynamic display.  (Abstract)  Words displayed as text on a screen may become objects having weights, and a domain may include a group of words that can be recognized as associated with each other.  A domain may be a broad geographic region in a map system, and a specific region may be defined as a domain, e.g., a Rome domain, a Colosseum domain, and a Gladiator domain.  A domain associated with a current screen may be created, and word information and domain information may be acquired for recognized objects (“wherein the contextual data represents a plurality of content items displayed by the computing device when the utterance occurred, and wherein a first content items of the plurality of content items is associated with domain data representing a first domain of a plurality of domains”).  Speech recognizer 120 may adjust a word weight for at least one word associated with the current screen and a domain weight for at least one domain included in the current screen, so that a language model assigns greater weight to words and domains related to a current screen than weights assigned to words and domains not related to the current screen.  (¶[0050] - ¶[0054]: Figure 2)  Words and domains included in a current screen may be transferred to controller 110 and display information manager 210, where word information may include word IDs, word coordinates, and domain information may include domain IDs and domain area coordinates.  (¶[0065]: Figure 2) Han et al.’s words and domains included in a current screen, then, are “second input data that indicates content associated with the first domain was displayed when the utterance occurred”.  Word weight adjusting unit 222 receives display information e.g., Gyeongbokgung, Gyotaejeon, etc., which are displayed on the screen at time t.  (¶[0107]: Figures 7A to 7B)  Figures 7A to 7B and 11A to 11B, then, illustrate that each of displayed location objects on a map, e.g., Gyeongbokgung, Royal Museum, Galleria Hyundai, National Folk Museum, and Kogsuji, is “a first content item of a plurality of content items”, and “a first element” is a weight assigned to “a first content item”, e.g., a weight value of 0.5 for Gyeongbokgung and “a second element” is a weight assigned to “a second content item”, e.g., a weight value of 0.4 for Royal Museum.  Han et al.’s weights for each displayed object, then, can be construed to be “third input data” corresponding to ‘elements’ for ‘content items’ of Applicants.  An objective is to improve a speech recognition rate and speed by reflecting information for a dynamic display.  (Abstract)  It would have been obvious to one having ordinary skill in the art to provide third input data of first and second elements for first and second content items as taught by Han et al. to perform natural language understanding in Mathias et al. for a purpose of improving speech recognition rate and speed.
Mathias et al. discloses that slot filler 206 can ensure that all data required to implement a user intent is present; slot filter 206 can remove unnecessary information from NLU results or modify NLU results, or ensure that all information necessary to implement the user intent is present; a user may have said “Give Me Shelter”, and determine that a correct interpretation is to play the song “Gimme Shelter” by the Rolling Stones (¶[0047]: Figure 3).  Mathias et al., then, discloses “generating second NLU output data . . . wherein the second NLU output data represents a correspondence of the portion of the utterance to a content slot associated with the intent data”, i.e., “Gimme Shelter” is “second NLU output data” that corresponds to “a content slot” associated with an intent to play that song.
Concerning claim 28, Mathias et al. discloses that an automatic speech recognition (“ASR”) module of a spoken language processing system may use various models including language models and acoustic models; other modules of a spoken language processing system include a natural language understanding (“NLU”) module that may interpret the user’s words as received from an ASR module (¶[0011]); domain-specific NLU modules may return scores regarding a likelihood that each result corresponds to the user’s actual intent; the result from the music domain may have a high score because Beethoven’s 5th symphony is an actual music recording (¶[0024]).    Mathias et al., then, discloses that an automatic speech recognition modules can include a language model that is separate and distinct from a natural language understanding module.  Accordingly, Mathias et al. discloses generating NLU output data “using an NLU model different from the language model.”

14 is rejected under 35 U.S.C. 103 as being unpatentable over Mathias et al. (U.S. Patent Publication 2015/0302002) in view of Han et al. (U.S. Patent Publication 2011/0029301) as applied to claim 12 above, and further in view of Sarikaya et al. (U.S. Patent No. 9,767,091).
Concerning claim 14, Mathias et al. does not clearly disclose “prior to receiving the audio data, generating display data using the first domain, wherein the display data represents the first content item, wherein the first content item is to be displayed by the computing device, and wherein the first content item is associated with the first domain” and “sending the display data to the computing device.”  Still, it is well known to display content items on a client device for selection in natural language processing, where a content item would implicitly be associated with a domain.  That is, these limitations only require displaying a content item associated with a domain to be selected by a spoken utterance.  Specifically, Sarikaya et al. teaches that a domain is predicted based on contextual information that includes display information, where this contextual information may include items located on a display of a client computing device.  (Column 6, Lines 40 to 49; Column 10, Lines 29 to 42)  Sarikaya et al. discloses that domain set predictor 135 receives a set of possible domains identified by natural language analysis component 130, and adds, modifies, or changes the possible domains based on other contextual information, which may include information previously received, turn-based information, and display information may be used to refine the set of possible domains.  (Column 6, Lines 40 to 51: Figure 1)  Contextual information may include information extracted from each turn in a turn.  Contextual information may include a response to a previous turn by dynamic system 100, where   An objective is to provide an analysis of incomplete natural language expressions using additional contextual information.  (Column 2, Lines 6 to 14)  It would have been obvious to one having ordinary skill in the art to display data in a domain as a content item prior to receiving audio data as taught by Sarikaya et al. in natural language processing using context of Mathias et al. for a purpose of analyzing incomplete natural language expressions.

Claims 22 to 24 are rejected under 35 U.S.C. 103 as being unpatentable over Mathias et al. (U.S. Patent Publication 2015/0302002) in view of Han et al. (U.S. Patent Publication 2011/0029301) as applied to claim 12 above, and further in view of Cao et al. (U.S. Patent Publication 2018/0032897).
Concerning claim 22, Han et al. teaches “wherein the third input data represents . . . the plurality of elements, wherein the first element comprises a first identifier of the first content item, and wherein the second element comprises a second identifier of the second content item.”  Here, Han et al. teaches that each word on a screen may be a name of a place or a name of an object, where each object has a weight.  (¶[0050]: Han et al., then, teaches “a first identifier of the first content item”, i.e., number 24 or coordinates (3.35, 5.75) and “a second identifier of the second content item”, i.e., number 25 or coordinates (3.46, 5.62).  The only limitation omitted by Han et al. is that the third input data “represents a vector”.  Still, a vector representation is commonly used whenever a plurality of components are grouped together, and x-y coordinates could conceivably be construed as a “vector”.  Moreover, Han et al. teaches representing a name of an object by words, and it is known in the art of machine learning to represent words as vectors.
Concerning claim 22, even if “a vector comprising the plurality of elements” is omitted by Han et al., this is taught by Cao et al.  Generally, Cao et al. teaches an embedding representation based on word clustering for a machine learning algorithm.  (Abstract)  An embedding representation is a low dimensional and real-valued vector.  (¶[0015])  A word is represented by a vector, which is called a word embedding.  (¶[0017]: Figure 1)  Here, Han et al. teaches a plurality of words corresponding to displayed names of places or names of objects on a map, and Cao et al. teaches that words may be represented by vectors for clustering in machine learning.  It would have been obvious to one having ordinary skill in the art to represent words corresponding to Han et al. as vectors as taught by Cao et al. for a purpose of clustering words in a machine learning algorithm.
Concerning claim 23, Han et al. teaches that displayed names of places or objects are identified by words IDs, and are ‘ordered’ in a table according to an x-coordinate that they are displayed as illustrated in Figures 11A to 11B.  That is, Figure 11B illustrates that words ‘Jipokjae’ to ‘Dongjeongmun’ are associated with word IDs numbers 21 to 29, and the x-coordinate increases from 3.10 to 3.72 corresponding to the order than they are displayed on a screen.  Cao et al. teaches that a word is represented by a vector, which is called a word embedding.  (¶[0017]: Figure 1)  If words corresponding to word IDs are arranged in an order corresponding how they are displayed as taught by Han et al., and these words are represented by a vector as taught by Cao et al., then “an order in which the plurality of elements are arranged corresponds to an order in which the plurality of content items were displayed by the computing device when the utterance occurred.”  Generally, ordering the elements of a vector according to the order the corresponding words were displayed could be considered by one skilled in the art to be an obvious ordering structure for Han et al. 
Concerning claim 24, Cao et al. teaches predicting a label for data using clusters of words, where a document may have a title, and clustering is based on word embeddings of words in the title.  (¶[0003])  Responsive to determining that a document has a title, clusters are ranked based on cosine similarity of word embeddings of words in the title.  (¶[0004])  A title is representing by summing word embeddings of the words in the title.  (¶[0026])  If a document has a title, clusters are ranked or ordered based on cosine similarity of words in the title.  (¶[0038])  Cao et al., then, teaches “the first  Han et al. teaches that content items can be words that are displayed on a screen, and Cao et al. teaches representing words in a title by word embeddings for machine learning.

Claims 26 to 27 are rejected under 35 U.S.C. 103 as being unpatentable over Mathias et al. (U.S. Patent Publication 2015/0302002) in view of Han et al. (U.S. Patent Publication 2011/0029301) as applied to claim 12 above, and further in view of Vibbert et al. (U.S. Patent Publication 2016/0042735).
Mathias et al. discloses contextual data related to a history of interactions, but omits “wherein the contextual data represents a plurality of key-value pairs, wherein a first key-value pair of the plurality of key-value pairs comprises key data representing the first domain and value data representing the first content item” and “generating the second input data using the key data” and “generate the third input data using the value data.”  However, Vibbert et al. teaches dialog flow management to extract task related information from a natural language input.  (Abstract)  A user client 201 may receive natural language dialog inputs including speech inputs from a human user, and an automatic speech recognition (ASR) engine 202 may process the speech inputs to determine corresponding sequences of representative text words.  A natural language understanding (NLU) engine 203 may process the text words to determine corresponding semantic interpretations.  Context sharing module 205 may provide a common context sharing mechanism to each of the dialog components including ASR Vibbert et al., then, teaches “a plurality of key-value pairs” {(CONTACT, Debbie Sanders), (CONTACT, Debbie Xanders)}, where ‘CONTACT’ is a “key” and ‘Debbie Sanders’ and ‘Debbie Xanders’ are “values”, and at least ‘(CONTACT, Debbie Sanders)’ is “a first key-value pair” that comprises “the first domain” of ‘CONTACT’ and “value data” of ‘Debbie Sanders’.  Correspondingly, “the second input data” is “the key data” of ‘CONTACT’ and “the third input data” is “the value data” of ‘Debbie Sanders’ and ‘Debbie Xanders’.  That is, “the third input data comprises a plurality of elements”, where “a first element” is ‘Debbie Sanders’ and “a second element” is ‘Debbie Xanders’.  Compare Specification, ¶[0142].  Vibbert et al. teaches using these key-value pairs as contextual information to choose among possible selections, where one embodiment is visual context information Vibbert et al. as context data of Mathias et al. for a purpose of weighting focus and expectation of visual context information.

Claim 30 is rejected under 35 U.S.C. 103 as being unpatentable over Mathias et al. (U.S. Patent Publication 2015/0302002) in view of Han et al. (U.S. Patent Publication 2011/0029301) as applied to claim 12 above, and further in view of Li et al. (U.S. Patent Publication 2018/0157638).
Mathias et al. does not disclose that natural language output data is generated using a first long short-term memory unit and a second long short-term memory unit, where first encoded utterance data comprises a first encoded representation of an utterance in a forward direction and a second encoded representation of the utterance in a backward direction.  However, a bi-directional long short-term memory neural network for natural language understanding producing forward and backward outputs from utterances is known in the prior art as taught by Li et al.  Generally, Li et al. teaches joint language understanding and dialogue management with a processing unit that can operate as an end-to-end recurrent neural network with contextual dialogue memory for natural language understanding (NLU).  (Abstract)  Specifically, an end-to-Mathias et al. using a bi-directional long short-term memory recurrent neural network of Li et al. to improve accuracy and processing speed of predictions from spoken utterances across a wide variety of domains without requiring hand-crafted features or a large number of annotated conversations.

31 is rejected under 35 U.S.C. 103 as being unpatentable over Mathias et al. (U.S. Patent Publication 2015/0302002) in view of Han et al. (U.S. Patent Publication 2011/0029301) as applied to claim 12 above, and further in view of Chen et al. (U.S. Patent Publication 2017/0372200).
Mathias et al. does not expressly disclose that natural language understanding uses a neural network as appears implied by “first neural network input data”, “second neural network input data”, and “third neural network input data”.  Given that there is no “neural network” actually set forth by the claim limitations, it is unclear how to construe “first neural network data”, “second neural network data”, and “third neural network data”.  That is, if there is no neural network expressly set forth by the claim language, then one could conceivably construe “first neural network input data”, “second neural network input data”, and “third neural network input data” as equivalent to “first input data”, “second input data”, and “third input data”, due to the inferential manner of claiming.  However, even if a neural network is omitted by Mathias et al., Chen et al. teaches end-to-end memory network for contextual language understanding using a neural network to encode inputs including utterances with intents and slots to exploit contextual information that includes visual context.  (Abstract)  A neural network models knowledge carryover in multi-turn conversations, where input encoded with intents and slots can be stored as embeddings in memory and decoding can exploit latent contextual information from memory including visual context.  (¶[0005])  Contextual information has proven useful for spoken language understanding (SLU), where keeping contextual knowledge increases a likelihood of correctly estimating a semantic slot message with the same intent.  Contextual information is incorporated into a Chen et al.’s neural network uses at least “first neural network input data” representing an utterance and “second neural network input data” representing visual context of “content . . . displayed when the utterance occurred”.  An objective is to improve accuracy and processing speed for spoken language understanding.  (¶[0004])  It would have been obvious to one having ordinary skill in the art to use a neural network to receive neural network input data to determine an intent by natural language understanding as taught by Chen et al. using input data of Han et al. for a purpose of improving accuracy and processing speech for spoken language understanding by exploiting latent contextual information from visual context.  

Response to Arguments
Applicants’ arguments filed 22 November 2021 have been fully considered but they are not persuasive.
Applicants amend independent claims 2 and 12, add new dependent claim 31, and present arguments traversing the prior rejection of these independent claims as being obvious under 35 U.S.C. §103 over Mathias et al. (U.S. Patent Publication 2015/0302002) in view of Han et al. (U.S. Patent Publication 2011/0029301).  Specifically, Applicants amend independent claim 2 to set forth new limitations directed to first “neural network” input data, second “neural network” input data, and third “neural Han et al. cannot be properly combined with Mathias et al. because this would change the principle of operation, citing MPEP §2143.01 VI.  Specifically, Applicants contend that Han et al. does not generate any natural language understanding (NLU) data, so that a modification would improperly change a principle of operation from speech recognition using a language model, and that there is no evidence or reasoned analysis about this in the Office Action.  Additionally, Applicants state that independent claim 2 is amended to set forth “first neural network input data”, “second neural network input data”, and “third neural network input data”, which is not taught by Han et al.  Applicants provide a brief argument directed against independent claim 12 that generating automatic speech recognition (‘ASR’) data and then receiving contextual data is patentably distinct from the prior art.  
Applicants’ separate amendments to independent claims 2 and 12 render these two inventions patentably distinct as Inventions I and II, and Invention I, corresponding to independent claim 2, is being withdrawn according to the doctrine of election by original presentation.  The Office Action sets forth reasoning as to why the two inventions are patentably distinct and would create a serious burden on examination if both inventions were examined together.  Significantly, the already lengthy and complex 
Applicants’ new dependent claim 31 raises issues of new matter under 35 U.S.C. §112(a), as new grounds of rejection.  (This new matter rejection could similarly be applied to withdrawn independent claim 2 of Invention II.)  Firstly, this claim omits ‘unclaimed essential subject matter’ of a neural network for the limitations of “first neural network input data”, “second neural network input data”, and “third neural network input data”.  See MPEP §2172.02.  Given that there is no neural network that is expressly set forth as an element of the claims, broad construction of the limitations of ‘neural network input data’ may render it equivalent to simply ‘input data’.  Secondly, Applicants’ Specification, as originally filed does not expressly describe “second neural network input data” and “third neural network input data”.  Here, ¶[0145] of the Specification appears to provide the most complete description of what is input to a neural network as including only text data from an automatic speech recognition (ASR) system as a vector and contextual data as context key vector data and context value vector data.  The Specification does not expressly use the terminology of “first neural network input data”, “second neural network input data”, and “third neural network input data”.  
New grounds of rejection are set forth as directed to new dependent claim 31 as being obvious under 35 U.S.C. §103 further in view of Chen et al. (U.S. Patent Publication 2017/0372200).  Generally, Chen et al. teaches neural networks for spoken language understanding using contextual information that can specifically include visual context to determine intents.  (Abstract)  Chen et al.’s input to the neural network Chen et al.)  Given that neural networks are increasingly being applied to tasks of natural language understanding, it would have been obvious to one having ordinary skill in the art to apply a neural network to process “first input data”, “second input data”, and “third input data” representing items displayed in Han et al. to improve accuracy and processing speed and improve intent classification as taught by Chen et al.
Applicants’ arguments as directed to Mathias et al. and Han et al. being an improper combination because it would improperly change the principle of operation is not persuasive.  The examiner submits that there is really not any evidence as to why the combination would ‘improperly change the principle of operation’ as presented by Applicants, and that this is simply an allegation.  Applicants are correct that there is nothing that generates natural language understanding input data in Han et al.  However, this does not render the combination improper, as an absence of this teaching in Han et al. would not change the principle of operation of Mathias et al.  Here, Mathias et al. uses context from a history as a hint to determine an intent for multi-domain natural language understanding.  Mathias et al. does not use display information as context, but using information of what is displayed on a screen to recognize speech is taught by Han et al.  Granted, Han et al. uses information about what is displayed and a language model to recognize speech instead of using a language model to recognize speech and context for natural language understanding in Mathias et al.  However, one skilled in the art could similarly simply ‘plug in’ information about what is currently being displayed as taught by Han et al. into context information for natural language Mathias et al.  The circumstance that Han et al. uses information about what is displayed at a different point in a sequence of steps does not ‘change a principle of operation’, but simply uses this information at a different point in a process.  There is nothing that would change a principle of operation from a viewpoint of one having ordinary skill in the art by varying a point in a process that one introduces a taught feature.  This argument is to some degree moot given that independent claim 2 is withdrawn according to a principle of election by original presentation.
Applicants’ sole argument is similarly unpersuasive as directed against independent 12.  Here, there is no improvement over the prior art when the same sequence of generating automatic speech recognition (ASR) data and then receiving contextual data is performed by Mathias et al.  That is, Applicants’ amendment to place a step of recognizing speech by an automatic speech recognition subsystem and then receiving contextual data does not distinguish over the prior art because this is precisely the order of steps that is being performed by Mathias et al.  Specifically, Mathias et al. discloses, at ¶[0013] - ¶[0014], ¶[0017] - ¶[0018], and ¶[0037] - ¶[0038], that ASR module 112 first produces a transcription of what the user said and then uses history and hints as context that is input to a multi-domain natural language understanding engine.  
Applicants’ arguments are not persuasive.  New grounds of rejection are set forth for new matter under 35 U.S.C. §112(a) and for obviousness under 35 U.S.C. §103 further in view of Chen et al. (U.S. Patent Publication 2017/0372200) for new dependent claim 31.  Applicants’ amendments raise an issue under the doctrine of election by original presentation necessitating withdrawal of claims from consideration under that 

Conclusion
The prior art made of record and not relied upon is considered pertinent to Applicants’ disclosure.
Sarikaya et al. (U.S. Patent No. 9,412,363), Watanabe et al., Celikyilmaz et al., Prokofieva et al., Nakagawa et al., and Aleksic et al. disclose related prior art.
Applicants’ amendment necessitated the new grounds of rejection presented in this Office Action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP §706.07(a).  Applicants are reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MARTIN LERNER whose telephone number is (571) 
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached on (571) 272- 5551.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair.  Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/MARTIN LERNER/Primary Examiner
Art Unit 2657                                                                                                                                                                                                        
January 13, 2022