Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 10/02/2020 has been entered.

Status of Claims
This action is in reply to the amendments and remarks filed on 10/02/2020.
Claims 1-20 are pending.
Claims 1, 12 and 17 have been amended.  

Response to Arguments
Applicant’s arguments, with respect to the rejection(s) of claim(s) 1, 12, and 17 under 35 U.S.C. 103, have been considered but they are not persuasive. 
First, the applicant argues that no art of record teaches “grouping, by the hardware processor, related candidates identified in the candidate data into respective groups of a confusion network”, since “there seems to be no connection entirely separate process from the audio frontend, regardless of whether they are performed on the same hardware.” And further that Xue and Henry do not cure the deficiencies of Ladhack. The examiner respectfully disagrees. 
Ladhak, Figs. 2 and 11 show the ”Acoustic Front End (AFE)” as a part of the “Automatic Speech Recognition (ASR)” and states it can be executed on the same device (same processor). Further, Col. 3, lines 15-27 and Figs. 2 and 11 teach the ASR using lattices and outputting “the resulting word lattice, in addition to (or instead of) simply outputting the top answer”. Col. 6, lines 22-48 and Figs. 6 and 9A then teaches these previously used/determined “resulting lattices or confusion networks” are utilized by the wakeword module to search for detected “wakeword[s]”; therefore, Ladhak implies that “confusion networks” are used in the teachings in alternative to lattices (as mapped above) and “the resulting…confusion networks” determined by the ASR.
Regarding applicant’s argument concerning the secondary Xue (Remarks filed 10/02/2020, section 2, paragraph 5), Xue, section 2 and section 3 paragraphs 5-6, section 3 “Algorithm Improvement”, and Figs. 2a-2b and 7 discuss how a confusion network is improved through grouping overlapping links and preserving the order, as well as creating an “improved” confusion network by executing different styles of confusion networks. Therefore, Xue does not merely state the “fact that confusion networks exist in the art”, but teaches how they are improved through, among additional methods, link groupings. See 35 U.S.C 103 section for full mapping of claim limitations.

calculating…for each of the candidates, a temporal next state of a Recurrent Neural Network (RNN)…the temporal next state of each related candidate in a group of the respective groups being calculated before calculating the temporal next state of the candidates in a next group of the respective groups”, since Ladhack does not suggest “that temporal next states for each candidate in a given group…would be calculated before calculating a next temporal state for a candidate in a next group”, and that “Ladhack does not actually teach…that each of the first words is processed before the second word is processed.” The examiner respectfully disagrees. Ladhak, Col. 19, line 19-Col. 20, line 51 and Fig. 9A teach “the process for decoding the first word may be repeated for each of the multiple first words and the process for decoding the second word is repeated for one or more second words following the multiple first words and so on”. See 35 U.S.C 103 section for full mapping of claim limitations.

Third, the applicant argues that no art of record teaches the amended claims 1, 12, and 17 limitations, which now recite “merging, by the hardware processor, the temporal next state of the related candidates of each group of the respective groups to obtain a plurality of merged temporal next states, each weighted by a probability of a corresponding candidate in the confusion network”, in view of the argued points above. The examiner respectfully disagrees. Ladhak, Col. 4, line 65-Col.5, line 20 teach a server (hardware processor) that, as taught in Col. 4, lines 4-44, Col. 17, line 57-Col. 18, line 2 and Figs. 6 and 9A, use an RNN including “a combination function (such as a pooling function) (merging)” when a node (temporal next state) is related candidates of each group)” to “determine a speech recognition result” at nodes (plurality of merged temporal next states)”.
Further, Col. 4, lines 4-44, Col. 6, lines 22-48, Col. 12, line 66-Col. 13, line 11, Col. 20, line 52-Col. 21, line 4, and Col. 23, lines 7-29 teach each lattice, or confusion network, node’s path word scores/probabilities “represent weights (each weighted by a respective occurrence probability)” and using a “weighted pooling function”; since, as taught above, Col. 3, lines 15-27, Col. 6, lines 22-48, and Figs. 2, 6, 9A, and 11 teach that “confusion networks” are used in the teachings in alternative to lattices and that “the resulting…confusion networks” are determined by the ASR.. See 35 U.S.C 103 section for full mapping of claim limitations necessitated by applicant amendments.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.

4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 1-3, 5-14, and 16-19 are rejected under 35 U.S.C. 103 as being unpatentable over Ladhak et al (US Patent 10176802), hereinafter Ladhak, in view of Xue et al (“Improved Confusion Network Algorithm and Shortest Path Search from Word Lattice”, 2005), hereinafter Xue.
Examiner note: based on applicant’s spec, the examiner is interpreting “candidates” to include or represent text words (paragraph 0018), music notes (paragraph 0039), etc., and the “candidate data” to contain more than one candidates (paragraphs 0034-0035) or representations of candidates (paragraph 0018). It is understood that “more than one candidates” is synonymous with a “plurality of candidates” and, therefore, in some cases the “candidate data” can be equivalent to the “plurality of candidates”.
Regarding claims 1, 12, and 17, Ladhak teaches a method, apparatus, comprising: a processor (Col. 4, line 65-Col.5, line 20, Col. 8, lines 34-48, Col. 24, lines 26-43 and Figs. 1 and 2 teach a server including processors (processor)), and computer program product comprising a non-transitory computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to perform a method (Col. 24, lines 24-53 and Col. 27, lines 24-40 teach “non-transitory computer readable storage medium” storing “instructions for causing a computer (executable by a computer)” to perform the embodiments of the disclosure), the method comprising:
obtaining, by a hardware processor, candidate data representing a plurality of candidates (Col. 4, line 65-Col.5, line 20, and Col.6, line 22-Col. 7, line 28 and Figs. 1 teach a server (hardware processor) receiving (obtaining) “audio data 111 (candidate data) corresponding to (representing) the spoken utterance (plurality of candidates)” of the user’s audio input and that the utterance can be a “sentence (plurality of candidates)”); 
grouping, by the hardware processor, related candidates identified in the candidate data into respective groups of a confusion network (Col. 4, line 65-Col.5, line 20 teach a server (hardware processor) that, as taught in Col. 7, line 31-52, Col. 13, line 47-Col. 14, line 16 and Fig. 6 teach a system dividing “audio data into frames” and associating (grouping) “two potential word choices (related candidates identified)”: “hello” or “yellow” with a “first group of acoustic frames (candidate data into respective groups of a confusion network)”. Further, Col. 3, lines 15-27 and Figs. 2 and 11 teach the ASR using lattices and outputting “the resulting word lattice, in addition to (or previously used/determined “resulting lattices or confusion networks” are utilized by the wakeword module to search for detected “wakeword[s]”; therefore, Ladhak implies that “confusion networks” are used in the teachings in alternative to lattices (as mapped above) and “the resulting…confusion networks” determined by the ASR.);
calculating, by the hardware processor, for each of the candidates, a temporal next state of a Recurrent Neural Network (RNN) by inputting a corresponding one of the candidates to the RNN at a current state (Col. 4, line 65-Col.5, line 20 teach a server (hardware processor) that, as taught in Col.11, line 47-Col. 12, line 40 and Figs. 3 and 4, use an RNN to predict “the potential next word (temporal next state of a RNN)” based on previous and “most recent (inputting candidates at a current state)” word inputs. Col. 21, line 54-Col. 22, line 9 further teach this concept using the “current word (candidates at a current state)”.), the temporal next state of each related candidate in a group of the respective groups being calculated before calculating the temporal next state of the candidates in a next group of the respective groups (Col. 19, line 19-Col. 20, line 51 and Fig. 9A teach processing the “each of multiple first words (related candidates in a group of the respective groups)” representing “multiple paths” and then processing further lattice portion words in order (before calculating…candidates in a next group)); 
merging, by the hardware processor, the temporal next state of the related candidates of each group of the respective groups to obtain a plurality of merged temporal next states (Col. 4, line 65-Col.5, line 20 teach a server (hardware merging)” when a node (temporal next state) is “reached through more than one arc (related candidates of each group)” to “determine a speech recognition result” at nodes (plurality of merged temporal next states)”.), each weighted by a probability of a corresponding candidate in the confusion network (Col. 4, lines 4-44, Col. 6, lines 22-48, Col. 12, line 66-Col. 13, line 11, Col. 20, line 52-Col. 21, line 4, and Col. 23, lines 7-29 teach each lattice, or confusion network, node’s path word scores/probabilities “represent weights (each weighted by a respective occurrence probability)” and using a “weighted pooling function”; since, as taught above, Col. 3, lines 15-27, Col. 6, lines 22-48, and Figs. 2, 6, 9A, and 11 teach that “confusion networks” are used in the teachings in alternative to lattices and that “the resulting…confusion networks” are determined by the ASR.); and 
representing multiple candidates with associated confidences, by the hardware processor, using the plurality of merged temporal next states (Col. 4, line 65-Col.5, line 20 teach a server (hardware processor) that, as taught in Col. 4, lines 4-44, Col. 7, lines 10-28, Col. 14, lines 37-59 and Col. 19, lines 19-51, determine recognition score/probabilities (representative confidences) for each arc (associated with multiple candidates) when reaching a “speech recognition result” at node (plurality of merged temporal next states)”).
Ladhak at least implies grouping, by the hardware processor, related candidates identified in the candidate data into respective groups of a confusion network (see mapping above), however Xue teaches grouping, by the hardware processor, related candidates identified in the candidate data into respective groups of a confusion network (section 5 teaches a processor (hardware processor) used for, as taught in section 2 and section 3 paragraphs 5-6, section 3 “Algorithm Improvement”, and Figs. 2a-2b and 7, executing a confusion network on input speech data (candidate data) by “group[ing] time overlapped links into clusters (respective groups) based on their phonetic similarity (related candidates) and word probabilities (related candidates) while preserving the precedence order of the links encoded in the original lattice.”).
Thus it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to implement Xue’s teachings of clustering phonetically similar speech data using a confusion network into Ladhak’s teaching of automatic speech recognition through word lattices and RNNs in order to improve accuracy of speech recognition utilizing a confusion network (Xue, section 2, section 3 paragraphs 5-6, section 5, and Figs. 2a-2b).

Regarding claims 2, 13, and 18, the combination of Ladhak and Xue teach all the claim limitations of claim 1, 12, and 17 and further teach updating the current state of the RNN according to the temporal next state of the RNN (Ladhak, Col. 4, lines 26-44 teach updating the scores of the RNN based on the present word of the RNN. Col. 17, line 33-Col. 18, line 25, Col. 21, line 54-Col.22, line 9 and Figs. 3 and 4 further teach feeding back internal output (temporal next state) of an RNN of a current word (updating the current state of the RNN) “to an internal input of the RNN for determining a numerical representation of a next word” or “set of inputs”).

Regarding claim 3, 14, and 19, the combination of Ladhak and Xue teach all the claim limitations of claim 1, 12, and 17 and further teach the obtaining step includes obtaining a plurality of occurrence probabilities for each of the candidates (Ladhak, Col. 2, line 45-Col. 3, line 2, Col. 19, lines 19-51 and Col. 20, line 52-Col. 21, line 4 teach using the received input data (candidate data) to construct a word lattice with determined (obtained) “corresponding probabilities (occurrence probabilities)” of the “possible word sequences” and “different paths” of each word (for each of the candidates), each of the occurrence probabilities indicating a probability of occurrence for each of the candidates represented by the candidate data (Ladhak, Col. 2, line 45-Col. 3, line 2, Col. 19, lines 19-51 and Col. 20, line 52-Col. 21, line 4 teach using input data (candidate data) to construct a word lattice with “corresponding probabilities (occurrence probabilities)” of the “possible word sequences” and “different paths” of each word (for each candidate represented by candidate data)),
and wherein the merging step includes calculating a mean value of temporal next states of the candidates (Ladhak, Col. 4, lines 4-44 teach the combination/pooling function cited above in claim 1 (merging step) for determining a resultant node from two or more lattice path words (next states of the candidates) using (calculating) an “average of the sum (mean value)” of the two or more paths), each of the temporal next states weighted by a respective one of the occurrence probabilities for each of the candidates (Ladhak, Col. 4, lines 4-44, Col. 20, line 52-Col. 21, line 4, and Col. 23, lines 7-29 teach lattice path word scores/probabilities .

Regarding claim 5 and 16, the combination of Ladhak and Xue teach all the claim limitations of claims 1 and 12 and further teach the candidate data has a directed graph structure, wherein each edge in the directed graph structure corresponds to one of the candidates from among the plurality of candidates (Ladhak, Col. 2, line 45-Col. 3, line 2, Col. 23, lines 30-47 and Figs. 6 and 9A teach the input data with a corresponding “the word lattice” being a “directed acyclic graph (directed graph structure)”. It is further taught that the lattice structure paths (edges) correspond to possible words/sounds (candidates) in the input data (plurality of candidates)).

Regarding claim 6, the combination of Ladhak and Xue teach all the claim limitations of claim 1 and further teach the candidate data has a confusion network structure, and wherein each link in the confusion network structure corresponds to one of the candidates from among the plurality of candidates (Ladhak, Col.6, line 22-Col. 7, line 28 and Figs. 6 and 9A teach a “LVCSR” system that uses audio data (candidate data) to perform a wakeword search in the lattices/“confusion networks (has a confusion network)”. It is further taught that the lattice structure paths (links) correspond to possible words/sounds (candidates) in the input data (plurality of candidates)).
the candidate data has a confusion network structure, and wherein each link in the confusion network structure corresponds to one of the candidates from among the plurality of candidates (see mapping above), however Xue teaches the candidate data has a confusion network structure, and wherein each link in the confusion network structure corresponds to one of the candidates from among the plurality of candidates (section 5 teaches a processor (hardware processor) used for, as taught in section 2 and section 3 paragraphs 5-6 and Figs. 2a-2b, executing a confusion network on input speech data (candidate data has a confusion network structure) by “group[ing] time overlapped links into clusters (respective groups) based on their phonetic similarity (links corresponding to candidates) and word probabilities (links corresponding to candidates) while preserving the precedence order of the links encoded in the original lattice.”).
Thus it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to implement Xue’s teachings of clustering links of phonetically similar speech data using a confusion network into Ladhak’s teaching of automatic speech recognition through word lattices and RNNs in order to improve accuracy of speech recognition utilizing a confusion network (Xue, section 2, section 3 paragraphs 5-6, section 5, and Figs. 2a-2b).

Regarding claim 7, the combination of Ladhak and Xue teach all the claim limitations of claim 6 and further teach each of the candidates corresponds to a word or a phrase in a text (Ladhak, Col. 2, line 45-Col. 3, line 2 and Col. 4, line 65-Col.5, line 20 and Figs. 6 and 9A teach the audio input data, converted to text, of a user’s .

Regarding claim 8, the combination of Ladhak and Xue teach all the claim limitations of claim 7 and further teach the obtaining candidate data further comprises generating, by speech recognition, speech-to-text data (Ladhak, Col. 2, line 13-Col. 3, line 2, and Col. 6, line 49-Col. 7, line 28 teach using automatic speech recognition to transform “audio data associated with speech into text representative of that speech (generating speech-to-text data)” for determining word lattice paths of possible words (obtain candidate data)).

Regarding claim 9, the combination of Ladhak and Xue teach all the claim limitations of claim 8 and further teach obtaining training data including a candidate data set corresponding to a correct output, wherein the candidate data set includes the candidate data representing the plurality of candidates; and training the RNN based on the training data (Ladhak, Col. 12, lines 24-56, Col. 14, lines 17-36 and Col. 18, line 59-Col.60, line 17 teach training an RNN on “a set of training data (candidate data set/training the RNN based on the training data)”, where training examples in the set consist of “given inputs…associated with known outputs (correspond to correct output)”. It is further taught that the inputs “represent a previous word” and the outputs represent a “potential next word” (candidate data) and can be .

Regarding claim 10, the combination of Ladhak and Xue teach all the claim limitations of claim 7 and further teach calculating output data by processing an output layer of the RNN based on at least the next state of the RNN, wherein the next state of the RNN corresponds to a recurrent layer of the RNN (Ladhak, Col. 12, lines 9-56, Col. 16, lines 25-56 and Fig. 4 teach predicting “the potential next word” being represented by an “output layer (next state of the RNN)” of an RNN by determining (processing) “the output one layer at a time (an output layer of the RNN) until the output layer of the entire network is calculated (calculating output data based on the next state of the RNN corresponding to a recurrent layer of the RNN)”).

Regarding claim 11, the combination of Ladhak and Xue teach all the claim limitations of claim 10 and further teach the output data further comprises at least one selected from the group consisting of an answer of a slot filling problem (Col. 10, line 13-Col. 11, line 24 teach analyzing the user interpreted audio input (output data) associating slots with grammatical tags (selected answers) and fill slots of the framework (slot filling problem)), an answer of key word spotting (Col.6, line 22-Col. 7, line 28 and Col. 25, line 60-Col. 26, line 36 teach interpreting user audio input with wakeword spotting or “keyword spotting”), and translated text (Col. 23, lines 30-47 teach “translating an ASR output (output data) into another language (translated text)”.).


Claims 4, 15, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Ladhak et al (US Patent 10176802), hereinafter Ladhak, in view of Xue et al (“Improved Confusion Network Algorithm and Shortest Path Search from Word Lattice”, 2005), hereinafter Xue, and further in view of Henry et al (US Pub 20170103305), hereinafter Henry.
Regarding claim 4, 15, and 20, the combination of Ladhak and Xue teach all the claim limitations of claim 1, 12, and 17. However the combination does not explicitly teach the RNN includes a Long Short-Term Memory (LSTM), and each of the current states and the next states includes a hidden state and a cell state.
Henry teaches the RNN includes a Long Short-Term Memory (LSTM), and each of the current states and the next states includes a hidden state and a cell state (paragraphs 0337-0038 and 0383-0384 teach an NNU configured to perform computation being an LSTM in an RNN for “[s]peech recognition” and other applications. Paragraphs 0388 teaches that “the cell state of the current time step” can be used in an equation, and paragraph 0389 teaches “the cell state of the time step (next state)”. Paragraph 0115 teaches that the NNU (RNN) performs hidden layer (hidden state) computations for the current layer (current state) and next layer (next state), implying the current and next layers include hidden layers to be computed.).
Thus it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to modify automatic speech recognition through word lattices and RNNs, as taught by Ladhak as modified by clustering links of phonetically similar speech data using a confusion network as taught by Xue, to include an NNU performing RNN with LSTM cell computations as taught by Henry in order to 

Prior Art
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Laurent et al (“Computer-assisted transcription of speech based on confusion network reordering”, 2011) teach utilizing confusion networks that group “temporally close words into confusion sets” for speech transcription.

Conclusion
13.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to CLINT MULLINAX whose telephone number is 571-272-3241.  The examiner can normally be reached on Mon - Fri 8:00-4:30 EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Alexey Shmatov can be reached on 571-270-3428.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for 




/C.M./Examiner, Art Unit 2123                                                                                                                                                                                                        


/ALEXEY SHMATOV/Supervisory Patent Examiner, Art Unit 2123