DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) was submitted on 05/05/2020. The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-3, 6-7, 9-11, 14-15, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Reavely; Simon Peter (US 10453117 B1; hereinafter referred to as Reavely et al.) further in view of Zhou; Zhengyu (US 20180096678 A1; hereinafter referred to as Zhou et al.) and M. Sundermeyer, et al. ("From Feedforward to Recurrent LSTM Neural Networks for Language Modeling," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 3, pp. 517-529, March 2015, doi: 10.1109/TASLP.2015.2400218. https://ieeexplore.ieee.org/document/7050391; hereinafter referred to as Sundermeyer et al.). 


As to independent claim 1, Reavely et al. teaches a method executed by a controller for speech recognition in a system (see Col. 6, lines 59-67 and Col. 23, lines 33-35: “(28) The device or devices performing the ASR processing may include an acoustic front end (AFE) 256 and a speech recognition engine 258 (107) Each of these devices (110/120) may include one or more controllers/processors (604/704), […]”) comprising:
parsing a plurality of candidate speech recognition results from a speech input (see Col. 8, lines 15-30 and Col. 8, lines 35-44: “(34) Generally, the NLU process takes textual input (such as processed from ASR 250 based on the utterance input audio 11) and attempts to make a semantic interpretation of the text. That is, the NLU process determines the meaning behind the text based on the individual words and then implements that meaning. (36) As will be discussed further below, the NLU process may be configured to parsed and tagged to annotate text as part of NLU processing. For example, for the text “call mom,” “call” may be tagged as a command (to execute a phone call) and “mom” may be tagged as a specific entity and target of the command (and the telephone number for the entity corresponding to “mom” stored in a contact list may be included in the annotated result).”);
extracting, based on natural language understanding (NLU) information, a NLU result from each of the plurality of candidate speech recognition results (see Col. 8, lines 31-34: “(35) The NLU may process several textual inputs related to the same utterance. For example, if the ASR 250 outputs N text segments (as part of an N-best list), the NLU may process all N outputs to obtain NLU results.”);
associating, via a neural network ranker, a ranking score to each of the plurality of candidate speech recognition results, the ranking score being based on the plurality of feature vectors and the NLU result of each of the plurality of candidate speech recognition results; wherein the neural network ranker promotes the second confidence score to be greater than the first confidence score based on the NLU related features (see Col. 11, lines 33-41; Col. 13, line 66-Col. 14, line 4; Col. 21, lines 5-45: “(51) For example, in a typical NLU system, the system may include the a multi-domain architecture consisting of multiple domains for the built-in intents executable by the system, such as music, video, books, and information. An example architecture for processing the built-in domains is illustrated along the right-hand side of FIG. 3 and may include the built-in domain recognizers 335, the built-in cross domain processing 355, the heavy slot filler and entity resolver 370 and the re-scorer and final ranker 390.  (68) While the cross-domain ranker 350 takes as input the built-in N-best lists 340, it may also consider other information, such as other data 391. The cross-domain ranker may use a number of different models or techniques such as a maximum entropy classifier, deep neural network, or the like. (99) Those confidence scores may be used to determine how to rank the individual NLU results represented in the N-best lists. The confidence scores may be affected by unfilled slots. For example, if one domain is capable of filling a slot (i.e., resolving the word in the slot to an entity or other recognizable form) for an input query the results from that domain may have a higher confidence than those from a different domain that is not capable of filling a slot. (100) The final ranker 390 may be configured to apply re-scoring, biasing, or other techniques to obtain the most preferred ultimate result. To do so, the final ranker 390 may consider not only the NLU results of the N-best lists, but may also consider other data 391. This other data 391 may include a variety of information. For example, the other data 391 may also include application rating or popularity. For example, if one application has a particularly high rating, the system may increase the score of results associated with that particular application. […]”), 
selecting a speech recognition result from the plurality of candidate speech recognition results that is associated with the ranking score having the highest value (see Col. 2, lines 58-67: “(15) Further, during runtime, a speech processing system may process a single utterance using multiple domains at the same time, or otherwise substantially in parallel. As the system may not know ahead of time what domain the utterance belongs in until the speech processing is complete, the system may process text of an utterance substantially simultaneously using models and components for different domains (e.g., books, video, music, etc.). The results of that parallel processing may be ranked, with the highest ranking results being executed/returned to the user.”); and
operating the system using the selected speech recognition result from the plurality of candidate speech recognition results corresponding to the highest ranking score as an input (see Col. 21, line 46 – Col. 22, line 4: “(101) The highest scoring result may be passed to a downstream command processor 290 for execution. If the highest scoring result belongs to a supplemental application, the downstream command processor 290 may be located separately from the system, for example command processor 290-X shown in FIG. 1. The final ranker 390 may be configured to output a top list of answers for further disambiguation/selection to determine which potential answer should be further processed/executed. Thus, the downstream command processor 290 may also be capable of outputting data to the user related to the top scoring result prior to execution. For example, if the input query text includes “get me an Uber to Boston” but the user account associated with the device 110 that sent the original query does not have the Uber application enabled, the system may output to the device 110 “do you want to enable Uber and order you a car?” Or the system may prompt the user to disambiguate and select a particular application. For example, if the input query text includes “get me a car to Boston” the system may output to the device 110 “you have enabled both Uber and Lyft. Which would you like to use to order a car to Boston?” Such interactions may also take place as part of a single session between the server 120 and the device 110, allowing the system to hold final execution of NLU results until the user activates an application, or selects which application should be used to process the query.”).
However, Reavely et al. does not explicitly teach:
receiving a first plurality of feature vectors from each of the plurality of candidate speech recognition results from a first speech recognition engine, the first plurality of feature vectors includes a first confidence score;
receiving a second plurality of feature vectors from each of the plurality of candidate speech recognition results from a second speech recognition engine that is different from the first speech recognition engine, the second plurality of feature vectors includes a second confidence score that is lower than the first confidence score;
compressing, to a shared projection layer, the first plurality of feature vectors and the second plurality of feature vectors via the shared projection layer based on the NLU result and NLU related features; and 
compressing the shared projection layer to a second projection layer further based on the NLU result and NLU related features

Zhou et al. does teach:
receiving a first plurality of feature vectors from each of the plurality of candidate speech recognition results from a first speech recognition engine, the first plurality of feature vectors includes a first confidence score (see ¶ [0005, 0013]: “[0005] In one embodiment, a method for performing speech recognition using hybrid speech recognition results has been developed. The method includes generating, with an audio input device, audio input data corresponding to speech input from a user, generating, with a controller, a first plurality of candidate speech recognition results corresponding to the audio input data using a first general-purpose speech recognition engine, generating, with the controller, a second plurality of candidate speech recognition results corresponding to the audio input data using a first domain-specific speech recognition engine[…] [0013] As used herein, the term “speech recognition result” refers to a machine-readable output that the speech recognition engine generates for a given input. The result can be, for example, text encoded in a machine-readable format or another set of encoded data that serve as inputs to control the operation of an automated system. Due to the statistical nature of speech recognition engines, in some configurations the speech engine generates multiple potential speech recognition results for a single input. The speech engine also generates a “confidence score” for each speech recognition result, where the confidence score is a statistical estimate of the likelihood that each speech recognition result is accurate based on the trained statistical model of the speech recognition engine.”);
receiving a second plurality of feature vectors from each of the plurality of candidate speech recognition results from a second speech recognition engine that is different from the first speech recognition engine, the second plurality of feature vectors includes a second confidence score that is lower than the first confidence score (see ¶ [0005, 0013] citations from previous limitation and ¶ [0037]: “[0037] […] For example, in FIG. 4 the domain-specific speech recognition engine 162 is specifically trained to recognize street names and other geographic terms with a higher accuracy than a general-purpose speech recognition engine. [0054] As described above, the controller 148 identifies the highest ranked speech recognition result based in part upon the confidence scores that are associated with each speech recognition result. The confidence scores are statistical values of an estimate of accuracy (confidence) for each speech recognition result that the speech recognition engines 162 generate in association with the speech recognition results.”);
Reavely et al. and Zhou et al. are both considered to be analogous to the claimed invention because they are in the same field of endeavor in speech processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Reavely et al. to incorporate the teachings of Zhou et al. of receiving a first plurality of feature vectors from each of the plurality of candidate speech recognition results from a first speech recognition engine, the first plurality of feature vectors includes a first confidence score; and receiving a second plurality of feature vectors from each of the plurality of candidate speech recognition results from a second speech recognition engine that is different from the first speech recognition engine, the second plurality of feature vectors includes a second confidence score that is lower than the first confidence score which provides the benefit of improving the accuracy of the final speech recognition result ([0018] of Zhou et al.).

However, Reavely et al. in combination with Zhou et al. do not explicitly teach:
compressing, to a shared projection layer, the first plurality of feature vectors and the second plurality of feature vectors via the shared projection layer based on the NLU result and NLU related features; and 
compressing the shared projection layer to a second projection layer further based on the NLU result and NLU related features.

Sundermeyer et al. does teach:
compressing, to a shared projection layer, the first plurality of feature vectors and the second plurality of feature vectors via the shared projection layer based on the NLU result and NLU related features (see Fig. 1 and ¶ 2 of Section III. Review of Neural Network LMs (page 519): “The input data w’i-2 and w’i-1 are the one-hot encoded predecessor words w i-2 and w i-1, where the weight matrix A is tied for all history words. The vectors A1w’i-2 and A1w’i-1 are then concatenated, indicated by the operator, to form the projection layer activation yi…” Here, the projection layer is interpreted as the projection layer (Fig. 1) and the vectors A1w’i-2 and A1w’i-1 are interpreted as the first and second plurality of feature vectors. Although Sudermeyer et al. do not explicitly discuss that these vectors are based on the NLU results and related features, the primary reference (Reavely et al.) discloses processing of ASR results to obtain NLU results, and secondary reference (Zhou et al.) discloses the use of multiple (first and second) plurality of feature vectors. Hence, Reavely et al. in combination with Zhou et al. and further in view of Sudermeyer et al. are interpreted to disclose the compression of the first and second plurality of feature vectors via a shared projection layer based on NLU results/related features.);
compressing the shared projection layer to a second projection layer further based on the NLU result and NLU related features (see Fig. 1 and ¶ 2 of Section III. Review of Neural Network LMs (page 519) citation as in previous limitation and further: “Multiplying yi with A2, and applying the sigmoid activation function, […], which is computed element-wise for the vector A2yi, results in the hidden layer activation zi.” Here, the second projection layer is interpreted as the hidden layer (Fig. 1; after the projection layer).);
Reavely et al. in combination with Zhou et al. and Sundermeyer et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Reavely et al. in combination with Zhou et al. to incorporate the teachings of Sundermeyer et al.  of compressing, to a shared projection layer, the first plurality of feature vectors and the second plurality of feature vectors via the shared projection layer based on the NLU result and NLU related features; and compressing the shared projection layer to a second projection layer further based on the NLU result and NLU related features which provides the benefit of reducing the computational complexity (¶ 3 of Section III. Review of Neural Network LMs (page 519) from Sundermeyer et al.).

As to independent claim 9, Reavely et al. in combination with Zhou et al. and Sundermeyer et al. all the limitations as in claim 1: 
A method executed by a controller for speech recognition in a system (see Col. 6, lines 59-67 and Col. 23, lines 33-35 citations of Reavely et al. as in claim 1 above) comprising:
parsing a plurality of candidate speech recognition results from a speech input (see Col. 8, lines 15-30 and Col. 8, lines 35-44 citations of Reavely et al. as in claim 1 above);
extracting a first plurality of feature vectors from each of the plurality of candidate speech recognition results via a first speech recognition engine (see ¶ [0005, 0013] citations of Zhou et al. as in claim 1 above);
extracting a second plurality of feature vectors from each of the plurality of candidate speech recognition results via a second speech recognition engine that is different from the first speech recognition engine (see ¶ [0005, 0013, 0037] citations of Zhou et al. as in claim 1 above);
extracting, based on natural language understanding (NLU) information, a NLU result from each of the plurality of candidate speech recognition results (see Col. 8, lines 31-34 citations of Reavely et al. as in claim 1 above);
compressing, to a shared projection layer, the first plurality of feature vectors and the second plurality of feature vectors via a shared projection layer based on based on the NLU result and NLU related features (see Fig. 1 and ¶ 2 of Section III. Review of Neural Network LMs (page 519) citations of Sundermeyer et al. as in claim 1 above);
compressing the shared projection layer to a second projection layer further based on the NLU result and NLU related features (see Fig. 1 and ¶ 2 of Section III. Review of Neural Network LMs (page 519) citations of Sundermeyer et al. as in claim 1 above);
associating, via a neural network ranker, a ranking score to each of the plurality of candidate speech recognition results, the ranking score being based on the plurality of feature vectors and the NLU result of each of the plurality of candidate speech recognition results (see Col. 11, lines 33-41; Col. 13, line 66-Col. 14, line 4; Col. 21, lines 5-45 citations of Reavely et al. as in claim 1 above);
selecting a speech recognition result from the plurality of candidate speech recognition results that is associated with the ranking score having the highest value (see Col. 2, lines 58-67 citation of Reavely et al. as in claim 1 above); and
operating the system using the selected speech recognition result from the plurality of candidate speech recognition results corresponding to the highest ranking score as an input (see Col. 21, line 46 – Col. 22, line 4 citation of Reavely et al. as in claim 1 above).

Regarding claim 2 and 10, Reavely et al. in combination with Zhou et al. and Sundermeyer et al. teaches all the limitations as in claim 1 and 9 as above.
Sundermeyer et al. further teaches:
 wherein the neural network ranker is a deep feedforward neural network ranker (see Fig. 1 and ¶ 2 of Section VI. Rescoring with Neural Network LMs (page 521): “A better representation of the search space is obtained by rescoring lattices as a replacement of n-best lists. Lattices are usually created with a count LM using a context size of at most four words. If a feedforward neural network LM is used, it is possible to simply replace the count LM estimates with those of the neural network model, and to use standard rescoring algorithms directly on the lattice.” Here, the feedforward neural network in Sundermeyer et al. is interpreted as analogous to a feedforward “deep neural network” given that the architecture consists of at least one hidden layer between the input and the output layers, see Fig. 1 (hidden layer and projection layer).).
Reavely et al. in combination with Zhou et al. and Sundermeyer et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Reavely et al. in combination with Zhou et al. to incorporate the teachings of Sundermeyer et al.  wherein the neural network ranker is a deep feedforward neural network ranker which provides the benefit of reducing the computational complexity (¶ 3 of Section III. Review of Neural Network LMs (page 519) from Sundermeyer et al.).

Regarding claim 3 and 11, Reavely et al. in combination with Zhou et al. and Sundermeyer et al. teaches all the limitations as in claim 1 and 9 as above.
Sundermeyer et al. further teaches:
wherein the compression is via a shared projection matrix (see Fig. 1 and ¶ 2 of Section III. Review of Neural Network LMs (page 519) citation as in claim 1: “The input data w’i-2 and w’i-1 are the one-hot encoded predecessor words w i-2 and w i-1, where the weight matrix A is tied for all history words. The vectors A1w’i-2 and A1w’i-1 are then concatenated, indicated by the operator, to form the projection layer activation yi…” Here, the projection layer is interpreted as the projection layer (Fig. 1) and the vectors A1w’i-2 and A1w’i-1 are interpreted as the first and second plurality of feature vectors.).
Reavely et al. in combination with Zhou et al. and Sundermeyer et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Reavely et al. in combination with Zhou et al. to incorporate the teachings of Sundermeyer et al. wherein the compression is via a shared projection matrix which provides the benefit of reducing the computational complexity (¶ 3 of Section III. Review of Neural Network LMs (page 519) from Sundermeyer et al.).

Regarding claim 6 and 14, Reavely et al. in combination with Zhou et al. and Sundermeyer et al. teaches all the limitations as in claim 1 and 9 as above.
Reavely et al. further teaches:
wherein the NLU information is a slot-based trigger features or a semantic feature representing slot and intent-sensitive sentence embedding (see Col. 2, lines 46-67 and Col. 9, line 64 - Col. 10 line 8: “(14) Present NLU query answering systems typically employ a multi-domain architecture where each domain represent a certain subject area for a system. Example domains include weather, music, shopping, etc. Each domain is typically configured with its own intents, slot structure, or the like as well as its own logic or other components needed to complete the NLU processing for a particular query. Thus, in order to configure a system to handle a new function, intents, slots and other items used for speech processing need to be specially designed, configured, and tested for each new function. This leads to significant resource expenditures to train and enable the system to handle additional domains. (43) The intents identified by the IC module 264 are linked to domain-specific grammar frameworks (included in 276) with “slots” or “fields” to be filled. Each slot/field corresponds to a portion of the query text that the system believes corresponds to an entity. For example, if “play music” is an identified intent, a grammar (276) framework or frameworks may correspond to sentence structures such as “Play {Artist Name},” “Play {Album Name},” “Play {Song name},” “Play {Song name} by {Artist Name},” etc.”).

Regarding claim 7 and 15, Reavely et al. in combination with Zhou et al. and Sundermeyer et al. teaches all the limitations as in claim 1 and 9 as above.
Zhou et al. further teaches:
wherein the first speech recognition engine is a domain- specific speech recognition engine, and the second speech recognition engine is a general-purpose speech recognition engine or cloud-based speech recognition engine (see ¶ [0005]: “[0005] In one embodiment, a method for performing speech recognition using hybrid speech recognition results has been developed. The method includes generating, with an audio input device, audio input data corresponding to speech input from a user, generating, with a controller, a first plurality of candidate speech recognition results corresponding to the audio input data using a first general-purpose speech recognition engine, generating, with the controller, a second plurality of candidate speech recognition results corresponding to the audio input data using a first domain-specific speech recognition engine[…]).
Reavely et al. in combination with Zhou et al. and Sundermeyer et al.  are both considered to be analogous to the claimed invention because they are in the same field of endeavor in speech processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Reavely et al. to incorporate the teachings of Zhou et al. wherein the first speech recognition engine is a domain- specific speech recognition engine, and the second speech recognition engine is a general-purpose speech recognition engine or cloud-based speech recognition engine which provides the benefit of improving the accuracy of the final speech recognition result ([0018] of Zhou et al.).

Regarding claim 20, Reavely et al. in combination with Zhou et al. teach all the limitations as in claim 17. 
However, Reavely et al. in combination with Zhou et al. do not explicitly teach:
wherein the processor is further programmed to compress, to a shared projection layer, the first plurality of feature vectors and the second plurality of feature vectors via the shared projection layer based on the NLU result and NLU related features, and compress the shared projection layer to a second projection layer further based on the NLU result and NLU related features.

Sundermeyer et al. does teach:
wherein the processor is further programmed to compress, to a shared projection layer, the first plurality of feature vectors and the second plurality of feature vectors via the shared projection layer based on the NLU result and NLU related features, and compress the shared projection layer to a second projection layer further based on the NLU result and NLU related features (see Fig. 1 and ¶ 2 of Section III. Review of Neural Network LMs (page 519) citation of Sundermeyer  et al. as in claim 1 above).
Reavely et al. in combination with Zhou et al. and Sundermeyer et al. are considered to be analogous to the claimed invention because they are in the same field of endeavor in speech processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Reavely et al. in combination with Zhou et al. to incorporate the teachings of Sundermeyer et al.  wherein the processor is further programmed to compress, to a shared projection layer, the first plurality of feature vectors and the second plurality of feature vectors via the shared projection layer based on the NLU result and NLU related features, and compress the shared projection layer to a second projection layer further based on the NLU result and NLU related features which provides the benefit of reducing the computational complexity (¶ 3 of Section III. Review of Neural Network LMs (page 519) from Sundermeyer et al.).

Claims 8 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Reavely; Simon Peter (US 10453117 B1; hereinafter referred to as Reavely et al.) further in view of Zhou; Zhengyu (US 20180096678 A1; hereinafter referred to as Zhou et al.) and M. Sundermeyer, et al. ("From Feedforward to Recurrent LSTM Neural Networks for Language Modeling," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 3, pp. 517-529, March 2015, doi: 10.1109/TASLP.2015.2400218. https://ieeexplore.ieee.org/document/7050391; hereinafter referred to as Sundermeyer et al.) as in claims 1 and 9 above and further in view of ZHOU, Xiao-tian et al. (CN 111291272 A; hereinafter referred to as ZHOU, Xiao-tian et al.). 

Regarding claim 8 and 16, Reavely et al. in combination with Zhou et al. and Sundermeyer et al. teaches all the limitations as in claim 1 and 9 as above.
However, Reavely et al. in combination with Zhou et al. and Sundermeyer et al. do not explicitly teach:
wherein the first plurality of feature vectors and the second plurality of feature vectors include a Bidirectional Long Short-Term Memory (BLSTM) feature.
ZHOU, Xiao-tian et al. does teach:
 wherein the first plurality of feature vectors and the second plurality of feature vectors include a Bidirectional Long Short-Term Memory (BLSTM) feature (see last two paragraphs of page 6: “S302, the to-be-detected data and sample data are respectively input to the two bidirectional long-term memory (Long Short-Term LCD/keyboard (hereinafter: LSTM) model for vectorization processing, respectively obtaining the first characteristic vector and the second characteristic vector. In the embodiment of the invention, the to-be-detected data with the sample data in the sample library is file data of the same type, the type may include, but are not limited to: text, voice, image or the like. In order to calculate the similarity between the detected data and the sample data, respectively performing vectorization processing to it, by calculating the similarity between vector to calculate similarity between files. In the embodiment of the invention, the mode can adopt the LSTM model, the to-be-detected data input in a bidirectional LSTM model, obtaining a first feature vector, and the sample data input to another bidirectional LSTM model to obtain the second feature vector. as twin neural network, shared parameters between the two bidirectional LSTM model.” Here, the the to-be-detected data and sample data are interpreted as the first and second plurality of feature vectors.).
Reavely et al. in combination with Zhou et al. and Sundermeyer et al.  and ZHOU, Xiao-tian et al. are both considered to be analogous to the claimed invention because they are in the same field of endeavor in speech processing. Therefore, it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified Reavely et al. to incorporate the teachings of ZHOU, Xiao-tian et al.  wherein the first plurality of feature vectors and the second plurality of feature vectors include a Bidirectional Long Short-Term Memory (BLSTM) feature which provides the benefit of extracting the most valuable features, allowing an improved fitting ability and ensuring convergence. ([second paragraph of page 7] of ZHOU, Xiao-tian et al.).

Claim 17 is rejected under 35 U.S.C. 103 as being unpatentable over Reavely; Simon Peter (US 10453117 B1; hereinafter referred to as Reavely et al.) further in view of Zhou; Zhengyu (US 20180096678 A1; hereinafter referred to as Zhou et al.).

As to independent claim 17, Reavely et al. in combination with Zhou et al.  teach all the limitations as in claim 1. 
Reavely et al., further teaches: a speech recognition system, comprising:
a microphone configured to receive a speech input from one or more users (see Col. 4, lines 50-67: “(21) […] An audio capture component, such as a microphone of device 110, captures audio 11 corresponding to a spoken utterance.”); 
a processor in communication with the microphone (see Col. 6, lines 11-39: “(26) […] A spoken utterance in the audio data is input to a processor configured to perform ASR which then interprets the utterance based on the similarity between the utterance and pre-established language models 254 stored in an ASR model knowledge base (ASR Models Storage 252).”), the processor programmed to [most limitations of claim 1]:
parse a plurality of candidate speech recognition results from a speech input (see Col. 8, lines 15-30 and Col. 8, lines 35-44 citation of Reavely et al. as in claim 1 above);
receive a first plurality of feature vectors from each of the plurality of candidate speech recognition results from a first speech recognition engine, the first plurality of feature vectors includes a first confidence score (see ¶ [0005, 0013] citations of Zhou et al. as in claim 1 above);
receive a second plurality of feature vectors from each of the plurality of candidate speech recognition results from a second speech recognition engine that is different from the first speech recognition engine, the second plurality of feature vectors includes a second confidence score that is lower than the first confidence score (see ¶ [0005, 0013, 0037] citations of Zhou et al. as in claim 1 above);
extract, based on natural language understanding (NLU) information, a NLU result from each of the plurality of candidate speech recognition results (see Col. 8, lines 31-34 citations of Reavely et al. as in claim 1 above);
associate, via a neural network ranker, a ranking score to each of the plurality of candidate speech recognition results, the ranking score being based on the plurality of feature vectors and the NLU result of each of the plurality of candidate speech recognition results, wherein the neural network ranker promotes the second confidence score to be greater than the first confidence score based on the NLU related features (see Col. 11, lines 33-41; Col. 13, line 66-Col. 14, line 4; Col. 21, lines 5-45 citations of Reavely et al. as in claim 1 above); and
select a speech recognition result from the plurality of candidate speech recognition results that is associated with the ranking score having the highest value (see Col. 2, lines 58-67 citation of Reavely et al. as in claim 1 above).

Regarding claim 18, Reavely et al. in combination with Zhou et al. and Sundermeyer et al. all the limitations as in claim 1. 
Reavely et al. further teaches:
wherein the processor is further programmed to operate the system using the selected speech recognition result from the plurality of candidate speech recognition results corresponding to the highest ranking score as an input (see Col. 21, line 46 – Col. 22, line 4 citation of Reavely et al. as in claim 1 above).

Regarding claim 19, Reavely et al. in combination with Zhou et al. and Sundermeyer et al. all the limitations as in claim 1. 
Reavely et al. further teaches:
wherein the processor is further programmed to train a neural network associated with the speech recognition system utilizing at least the NLU result (see Col. 8, lines 31-34 and Fig. 2 (NER component, 262; part of the NLU system), Col. 9, lines 17-63, and Col. 18, lines 21-36: “(35) The NLU may process several textual inputs related to the same utterance. For example, if the ASR 250 outputs N text segments (as part of an N-best list), the NLU may process all N outputs to obtain NLU results. (42) In order to generate a particular interpreted response, the NER 262 applies the grammar models and lexical information associated with the respective domain to actually recognize a mention one or more entities in the text of the query. In this manner the NER 262 identifies “slots” (i.e., particular words in query text) that may be needed for later command processing. (88) Further, training examples may include specific applications as ground truth labels, allowing the model for the supplemental intent category recognizer 310 to handle such inputs as well. The input training examples may also include labeled slots, thus allowing the model for the supplemental intent category recognizer 310 to learn the high level slot tagging as well.”).

Allowable Subject Matter
Claims 4-5 and 12-13 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The following is a statement of reasons for the indication of allowable subject matter:  
Reavely et al. in combination with Zhou et al. and Sundermeyer et al. teach all of the limitations as in claims 3 and 11, above.
Reavely et al. in combination with Zhou et al. and Sundermeyer et al. further teach a method “further comprising, in response to the first and second plurality of feature vectors being less than a threshold size, the second plurality of feature vectors are fed directly to the neural network ranker, wherein the threshold size of feature vectors is less than 2 features per hypothesis.” (see ¶ [0015, 0026] from Zhou et al.: “[0015] As used herein, the term “trigger pair” refers to two words, each of which can either be a word (e.g., “play”) or a predetermined class (e.g., <Song Name>) representing a word sequence (e.g., “Poker Face”) that falls within the predetermined class, such as the proper name of a song, person, location name, etc. [0026] For example, given the feature vectors that are generated for two candidate speech recognition results h1 and h2 as inputs, the controller 148 executes the pairwise ranker 164 to generate an a first “positive” output, meaning h1 wins, if the feature vector input for h1 has a lower estimate word error rate than h2, which indicates that h1 is “better” than h2. Otherwise, the pairwise ranker 164 generates a second “negative” output to indicate that the estimate word error rate of h2 is lower than h1. After processing every pair of candidate speech recognition results, the system 100 identifies the candidate speech recognition result with the greatest number of wins from the pairwise ranker 164 as the highest ranked candidate speech recognition result. For example, for a hypothesis list “h1, h2, h3”, if h2 wins in the hypothesis pair (h1, h2), h1 wins in (h1, h3) and h2 wins in (h2, h3), h1, h2, h3 win 1 time, 2 times, and 0 times, respectively. Since h2 wins the largest number of times, the system 100 identifies h2 as the highest ranked candidate speech recognition result. Alternative embodiments of the pairwise ranker 164 use other classification techniques instead of the Random Forest approach to rank the candidate speech recognition results. In some embodiments, the pairwise ranker 164 is also trained using other classification features, such as the confidence score related feature and the “bag-of-words with decay” related features, in addition to the trigger pair related features. […].” Here, it is interpreted that the threshold is two for the pairwise ranker, where the trigger pairs refer to 2 words (two features in the feature vectors/hypotheses (h1, h2)).).
However, as noted with respect to claims 4 and 12, Reavely et al. in combination with Zhou et al. and Sundermeyer et al. fail to teach the method “further comprising, in response to the first and second plurality of feature vectors being less than a threshold size, bypassing, by the controller, the shared and second projection layers such that the second plurality of feature vectors are fed directly to the neural network ranker, wherein the threshold size of feature vectors is less than 2 features per hypothesis.”
Claims 5 and 13 would be allowable because they are dependent on claims 4 and 12, respectively.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Keisha Y Castillo-Torres whose telephone number is (571)272-3975. The examiner can normally be reached Monday - Friday, 8:30 am - 4:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached on (571)272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

Keisha Y. Castillo-Torres
Examiner
Art Unit 2659



/Keisha Y. Castillo-Torres/Examiner, Art Unit 2659                                                                                                                                                                                                        
/Paras D Shah/Primary Examiner, Art Unit 2659                                                                                                                                                                                                        
04/29/2022