DETAILED ACTION
Introduction
This office action is in response to Applicant submission filed on 3/11/2021. Claims 1-14 are pending in the application. As such, Claims 1-14 have been examined.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 4/27/2022 was filed.  The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1-14 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Watanabe et al. (US 9443527 B1) (Further referred to as “Watanabe”).

Regarding Claim 1, Watanabe teaches a method implemented by one or more processors, the method comprising: detecting, at a client device, audio data that captures a spoken utterance of a user, wherein the client device is in an environment with one or more additional client devices and is in local communication with the one or more additional client devices via a local network, the one or more additional client devices including at least a first additional client device (Watanabe Column 3 Lines 41-51 - As illustrated in FIG. 2, the device 100 includes a variety of components which may communicate through an address/data bus 224. Each component may also be directly connected to other components in addition to (or instead of) being connected to other components across the bus 224. The ASR device 100 may include an audio capture device 212 for capturing spoken utterances for processing. The audio capture device 212 may include a microphone or other suitable component for capturing sound. The audio capture device 212 may be integrated into the ASR device 100 or may be separate from the ASR device 100.);
processing, at the client device, the audio data using an automatic speech recognition ("ASR") model stored locally at the client device to generate a candidate text representation of the spoken utterance (Watanabe Column 3 Lines 41-51 - As illustrated in FIG. 2, the device 100 includes a variety of components which may communicate through an address/data bus 224. Each component may also be directly connected to other components in addition to (or instead of) being connected to other components across the bus 224. The ASR device 100 may include an audio capture device 212 for capturing spoken utterances for processing. The audio capture device 212 may include a microphone or other suitable component for capturing sound. The audio capture device 212 may be integrated into the ASR device 100 or may be separate from the ASR device 100.);
receiving, at the client device, from the first additional client device and via the local network, a first additional candidate text representation of the spoken utterance, the first additional candidate text representation of the spoken utterance generated locally at the first additional client device is based on (a) the audio data and/or (b) locally detected audio data capturing the spoken utterance detected at the first additional client device, wherein the first additional candidate text representation of the spoken utterance is generated by processing the audio data and/or the locally generated audio data using a first additional ASR model stored locally at the first additional client device (Watanabe Column 5 Lines 19-35 - The device may also include an ASR module 214 for processing spoken audio data into text. The ASR module may be part of a speech processing module 240 or may be a separate component. The ASR module 214 transcribes audio data into text data representing the words of the speech contained in the audio data. The text data may then be used by other components for various purposes, such as executing system commands, inputting data, etc. Audio data including spoken utterances may be processed in real time or may be saved and processed at a later time. A spoken utterance in the audio data is input to the ASR module 214 which then interprets the utterance based on the similarity between the utterance and models known to the ASR module 214. For example, the ASR module 214 may compare the input audio data with models for sounds (e.g., speech units or phonemes) and sequences of sounds to identify words that match the sequence of sounds spoken in the utterance of the audio data.);
and determining a text representation of the spoken utterance based on the candidate text representation of the spoken utterance and the first additional candidate text representation of the spoken utterance generated by the first additional client device (Watanabe Column 5 Lines 19-35 - The device may also include an ASR module 214 for processing spoken audio data into text. The ASR module may be part of a speech processing module 240 or may be a separate component. The ASR module 214 transcribes audio data into text data representing the words of the speech contained in the audio data. The text data may then be used by other components for various purposes, such as executing system commands, inputting data, etc. Audio data including spoken utterances may be processed in real time or may be saved and processed at a later time. A spoken utterance in the audio data is input to the ASR module 214 which then interprets the utterance based on the similarity between the utterance and models known to the ASR module 214. For example, the ASR module 214 may compare the input audio data with models for sounds (e.g., speech units or phonemes) and sequences of sounds to identify words that match the sequence of sounds spoken in the utterance of the audio data.).

Regarding Claim 2, Watanabe teaches all of the limitations of claim 1. Watanabe also teaches that the one or more additional client devices includes at least the first additional client device and a second additional client device (Watanabe Column 3 Lines 20-28 - Multiple ASR devices may be employed in a single speech recognition system. In such a multi-device system, the ASR devices may include different components for performing different aspects of the speech recognition process. The multiple devices may include overlapping components. The ASR device as illustrated in FIG. 2 is exemplary, and may be a stand-alone device or may be included, in whole or in part, as a component of a larger device or system.);
wherein receiving, at the client device, from the first additional client device and via the local network, the first additional candidate text representation further comprises: receiving, at the client device, from the second additional client device and via the local network, a second additional candidate text representation of the spoken utterance generated locally at the second additional client device is based on (a) the audio data and/or (b) additional locally detected audio data capturing the spoken utterance detected at the second additional client device, wherein the second additional candidate text representation of the spoken utterance is generated by processing the audio data and/or the additional locally generated audio data using a second additional ASR model stored locally at the second additional client device (Watanabe Column 5 Lines 19-35 - The device may also include an ASR module 214 for processing spoken audio data into text. The ASR module may be part of a speech processing module 240 or may be a separate component. The ASR module 214 transcribes audio data into text data representing the words of the speech contained in the audio data. The text data may then be used by other components for various purposes, such as executing system commands, inputting data, etc. Audio data including spoken utterances may be processed in real time or may be saved and processed at a later time. A spoken utterance in the audio data is input to the ASR module 214 which then interprets the utterance based on the similarity between the utterance and models known to the ASR module 214. For example, the ASR module 214 may compare the input audio data with models for sounds (e.g., speech units or phonemes) and sequences of sounds to identify words that match the sequence of sounds spoken in the utterance of the audio data.);
and wherein determining the text representation of the spoken utterance based on the candidate text representation of the spoken utterance and the first additional candidate text representation of the spoken utterance generated by the first additional client device further comprises: determining the text representation of the spoken utterance based on the candidate text representation of the spoken utterance, the first additional candidate text representation of the spoken utterance generated by the first additional client device, and the second additional candidate text representation of the spoken utterance generated by the second additional client device (Watanabe Column 5 Lines 19-35 - The device may also include an ASR module 214 for processing spoken audio data into text. The ASR module may be part of a speech processing module 240 or may be a separate component. The ASR module 214 transcribes audio data into text data representing the words of the speech contained in the audio data. The text data may then be used by other components for various purposes, such as executing system commands, inputting data, etc. Audio data including spoken utterances may be processed in real time or may be saved and processed at a later time. A spoken utterance in the audio data is input to the ASR module 214 which then interprets the utterance based on the similarity between the utterance and models known to the ASR module 214. For example, the ASR module 214 may compare the input audio data with models for sounds (e.g., speech units or phonemes) and sequences of sounds to identify words that match the sequence of sounds spoken in the utterance of the audio data.).

Regarding Claim 3, Watanabe teaches all of the limitations of claim 1. Watanabe also teaches that determining the text representation of the spoken utterance based on the candidate text representation of the spoken utterance and the first additional candidate text representation of the spoken utterance generated by the first additional client device comprises: randomly selecting either the candidate text representation of the spoken utterance or the first additional candidate text representation of the spoken utterance (Watanabe Column 9 Lines 60-67 and Column 10 Lines 1-9 - As part of the language modeling (or in other phases of the ASR processing) the speech recognition engine 218 may, to save computational resources, prune and discard low recognition score states or paths that have little likelihood of corresponding to the spoken utterance, either due to low recognition score pursuant to the language model, or for other reasons. Further, during the ASR processing the speech recognition engine 218 may iteratively perform additional processing passes on previously processed utterance portions. Later passes may incorporate results of earlier passes to refine and improve results. As the speech recognition engine 218 determines potential words from the input audio the lattice may become very large as many potential sounds and words are considered as potential matches for the input audio. The potential matches may be illustrated as a word result network representing possible sequences of words that may be recognized and the likelihood of each sequence.);
and determining the text representation of the spoken utterance based on the random selection (Watanabe Column 10 Lines 20-24 - FIG. 7 shows an example of a word result network that may be used by a speech recognition engine 218 for recognizing speech according to some aspects of the present disclosure. A word result network may consist of sequences of words that may be recognized and the likelihood of each sequence. The likelihood of any path in the word result network may be determined by an acoustic model and a language model. In FIG. 7, the paths shown include, for example, “head”, “hello I”, “hello I'm”, “hen”, “help I”, “help I'm”, “”hem”, “Henry I”, “Henry I'm”, and “hedge”.).

Regarding Claim 4, Watanabe teaches all of the limitations of claim 1. Watanabe also teaches that determining the text representation of the spoken utterance based on the candidate text representation of the spoken utterance and the first additional candidate text representation of the spoken utterance generated by the first additional client device comprises: determining a confidence score of the candidate text representation indicating a probability that the candidate text representation is the text representation, where the confidence score is based on one or more device parameters of the client device (Watanabe Column 5 Lines 36-50 - The different ways a spoken utterance may be interpreted may each be assigned a probability or a recognition score representing the likelihood that a particular set of words matches those spoken in the utterance. The recognition score may be based on a number of factors including, for example, the similarity of the sound in the utterance to models for language sounds (e.g., an acoustic model), and the likelihood that a particular word which matches the sounds would be included in the sentence at the specific location (e.g., using a language model or grammar). Based on the considered factors and the assigned recognition score, the ASR module 214 may output the most likely words recognized in the audio data. The ASR module 214 may also output multiple alternative recognized words in the form of a lattice or an N-best list (described in more detail below).);
determining an additional confidence score of the additional candidate text representation indicating an additional probability that the additional candidate text representation is the text representation, where the additional confidence score is based on one or more additional device parameters of the additional client device (Watanabe Column 5 Lines 36-50 - The different ways a spoken utterance may be interpreted may each be assigned a probability or a recognition score representing the likelihood that a particular set of words matches those spoken in the utterance. The recognition score may be based on a number of factors including, for example, the similarity of the sound in the utterance to models for language sounds (e.g., an acoustic model), and the likelihood that a particular word which matches the sounds would be included in the sentence at the specific location (e.g., using a language model or grammar). Based on the considered factors and the assigned recognition score, the ASR module 214 may output the most likely words recognized in the audio data. The ASR module 214 may also output multiple alternative recognized words in the form of a lattice or an N-best list (described in more detail below).);
comparing the confidence score and the additional confidence score (Watanabe Column 5 Lines 36-50 - The different ways a spoken utterance may be interpreted may each be assigned a probability or a recognition score representing the likelihood that a particular set of words matches those spoken in the utterance. The recognition score may be based on a number of factors including, for example, the similarity of the sound in the utterance to models for language sounds (e.g., an acoustic model), and the likelihood that a particular word which matches the sounds would be included in the sentence at the specific location (e.g., using a language model or grammar). Based on the considered factors and the assigned recognition score, the ASR module 214 may output the most likely words recognized in the audio data. The ASR module 214 may also output multiple alternative recognized words in the form of a lattice or an N-best list (described in more detail below).);
and determining the text representation of the spoken utterance based on the comparing (Watanabe Column 5 Lines 36-50 - The different ways a spoken utterance may be interpreted may each be assigned a probability or a recognition score representing the likelihood that a particular set of words matches those spoken in the utterance. The recognition score may be based on a number of factors including, for example, the similarity of the sound in the utterance to models for language sounds (e.g., an acoustic model), and the likelihood that a particular word which matches the sounds would be included in the sentence at the specific location (e.g., using a language model or grammar). Based on the considered factors and the assigned recognition score, the ASR module 214 may output the most likely words recognized in the audio data. The ASR module 214 may also output multiple alternative recognized words in the form of a lattice or an N-best list (described in more detail below).). 

 Regarding Claim 5, Watanabe teaches all of the limitations of claim 1. Watanabe also teaches that determining the text representation of the spoken utterance based on the candidate text representation of the spoken utterance and the first additional candidate text representation of the spoken utterance generated by the first additional client device comprises: determining an audio quality value indicating the quality of the audio data that captures the spoken utterance detected at the client device (Watanabe Column 5 Lines 51-60 - While a recognition score may represent a probability that a portion of audio data corresponds to a particular phoneme or word, the recognition score may also incorporate other information which indicates the ASR processing quality of the scored audio data relative to the ASR processing of other audio data. A recognition score may be represented as a number on a scale from 1 to 100, as a probability from 0 to 1, a log probability or other indicator. A recognition score may indicate a relative confidence that a section of audio data corresponds to a particular phoneme, word, etc.);
determining an additional audio quality value indicating the quality of the additional audio data capturing the spoken utterance detected at the first additional client device (Watanabe Column 5 Lines 51-60 - While a recognition score may represent a probability that a portion of audio data corresponds to a particular phoneme or word, the recognition score may also incorporate other information which indicates the ASR processing quality of the scored audio data relative to the ASR processing of other audio data. A recognition score may be represented as a number on a scale from 1 to 100, as a probability from 0 to 1, a log probability or other indicator. A recognition score may indicate a relative confidence that a section of audio data corresponds to a particular phoneme, word, etc.);
comparing the audio quality value and the additional audio quality value (Watanabe Column 5 Lines 51-60 - While a recognition score may represent a probability that a portion of audio data corresponds to a particular phoneme or word, the recognition score may also incorporate other information which indicates the ASR processing quality of the scored audio data relative to the ASR processing of other audio data. A recognition score may be represented as a number on a scale from 1 to 100, as a probability from 0 to 1, a log probability or other indicator. A recognition score may indicate a relative confidence that a section of audio data corresponds to a particular phoneme, word, etc.);
and determining the text representation of the spoken utterance based on the comparing (Watanabe Column 5 Lines 51-60 - While a recognition score may represent a probability that a portion of audio data corresponds to a particular phoneme or word, the recognition score may also incorporate other information which indicates the ASR processing quality of the scored audio data relative to the ASR processing of other audio data. A recognition score may be represented as a number on a scale from 1 to 100, as a probability from 0 to 1, a log probability or other indicator. A recognition score may indicate a relative confidence that a section of audio data corresponds to a particular phoneme, word, etc.).

Regarding Claim 6, Watanabe teaches all of the limitations of claim 1. Watanabe also teaches that determining the text representation of the spoken utterance based on the candidate text representation of the spoken utterance and the first additional candidate text representation of the spoken utterance generated by the first additional client device comprises: determining an ASR quality value indicating the quality of the ASR model stored locally at the client device (Watanabe Column 5 Lines 51-60 - While a recognition score may represent a probability that a portion of audio data corresponds to a particular phoneme or word, the recognition score may also incorporate other information which indicates the ASR processing quality of the scored audio data relative to the ASR processing of other audio data. A recognition score may be represented as a number on a scale from 1 to 100, as a probability from 0 to 1, a log probability or other indicator. A recognition score may indicate a relative confidence that a section of audio data corresponds to a particular phoneme, word, etc.);
determining an additional ASR quality value indicating the quality of the additional ASR model stored locally at the additional client device (Watanabe Column 5 Lines 51-60 - While a recognition score may represent a probability that a portion of audio data corresponds to a particular phoneme or word, the recognition score may also incorporate other information which indicates the ASR processing quality of the scored audio data relative to the ASR processing of other audio data. A recognition score may be represented as a number on a scale from 1 to 100, as a probability from 0 to 1, a log probability or other indicator. A recognition score may indicate a relative confidence that a section of audio data corresponds to a particular phoneme, word, etc.);
comparing the ASR quality value and the additional ASR quality value (Watanabe Column 5 Lines 51-60 - While a recognition score may represent a probability that a portion of audio data corresponds to a particular phoneme or word, the recognition score may also incorporate other information which indicates the ASR processing quality of the scored audio data relative to the ASR processing of other audio data. A recognition score may be represented as a number on a scale from 1 to 100, as a probability from 0 to 1, a log probability or other indicator. A recognition score may indicate a relative confidence that a section of audio data corresponds to a particular phoneme, word, etc.);
and determining the text representation of the spoken utterance based on the comparing (Watanabe Column 5 Lines 51-60 - While a recognition score may represent a probability that a portion of audio data corresponds to a particular phoneme or word, the recognition score may also incorporate other information which indicates the ASR processing quality of the scored audio data relative to the ASR processing of other audio data. A recognition score may be represented as a number on a scale from 1 to 100, as a probability from 0 to 1, a log probability or other indicator. A recognition score may indicate a relative confidence that a section of audio data corresponds to a particular phoneme, word, etc.).

Regarding Claim 7, Watanabe teaches all of the limitations of claim 1. Watanabe also teaches that the first additional candidate text representation of the spoken utterance includes a plurality of hypotheses, and wherein determining the text representation of the spoken utterance based on the candidate text representation of the spoken utterance and the first additional candidate text representation of the spoken utterance generated by the first additional client device comprises: reranking the plurality of hypotheses using the client device (Watanabe Column 8 Lines 20-36 - Transitions between states may also have an associated probability, representing a likelihood that a current state may be reached from a previous state. Sounds received may be represented as paths between states of the HMM and multiple paths may represent multiple possible text matches for the same sound. Each phoneme may be represented by multiple potential states corresponding to different known pronunciations of the phonemes and their parts (such as the beginning, middle, and end of a spoken language sound). An initial determination of a probability of a potential phoneme may be associated with one state. As new feature vectors are processed by the speech recognition engine 218, the state may change or stay the same, based on the processing of the new feature vectors. A Viterbi algorithm may be used to find the most likely sequence of states based on the processed feature vectors.);
and determining the text representation of the spoken utterance based on the candidate text representation of the spoken utterance and the reranked plurality of hypotheses (Watanabe Column 8 Lines 20-36 - Transitions between states may also have an associated probability, representing a likelihood that a current state may be reached from a previous state. Sounds received may be represented as paths between states of the HMM and multiple paths may represent multiple possible text matches for the same sound. Each phoneme may be represented by multiple potential states corresponding to different known pronunciations of the phonemes and their parts (such as the beginning, middle, and end of a spoken language sound). An initial determination of a probability of a potential phoneme may be associated with one state. As new feature vectors are processed by the speech recognition engine 218, the state may change or stay the same, based on the processing of the new feature vectors. A Viterbi algorithm may be used to find the most likely sequence of states based on the processed feature vectors.).

Regarding Claim 8, Watanabe teaches all of the limitations of claim 1. Watanabe also teaches that prior to receiving, at the client device, from the first additional client device and via the local network, the first additional candidate text representation of the spoken utterance, and further comprising: determining whether to generate the first additional candidate representation of the spoken utterance locally at the first additional client device based on (a) the audio data and/or (b) the locally detected audio data capturing the spoken utterance detected at the first additional client device, wherein determining whether to generate the first additional candidate representation of the spoken utterance locally at the first additional client device based on (a) the audio data and/or (b) the locally detected audio data capturing the spoken utterance detected at the first additional client device comprises: determining an audio quality value indicating the quality of the audio data that captures the spoken utterance detected at the client device (Watanabe Column 5 Lines 51-60 - While a recognition score may represent a probability that a portion of audio data corresponds to a particular phoneme or word, the recognition score may also incorporate other information which indicates the ASR processing quality of the scored audio data relative to the ASR processing of other audio data. A recognition score may be represented as a number on a scale from 1 to 100, as a probability from 0 to 1, a log probability or other indicator. A recognition score may indicate a relative confidence that a section of audio data corresponds to a particular phoneme, word, etc.);
determining an additional audio quality value indicating the quality of the locally detected audio data capturing the spoken utterance detected at the first additional client device (Watanabe Column 5 Lines 51-60 - While a recognition score may represent a probability that a portion of audio data corresponds to a particular phoneme or word, the recognition score may also incorporate other information which indicates the ASR processing quality of the scored audio data relative to the ASR processing of other audio data. A recognition score may be represented as a number on a scale from 1 to 100, as a probability from 0 to 1, a log probability or other indicator. A recognition score may indicate a relative confidence that a section of audio data corresponds to a particular phoneme, word, etc.);
comparing the audio quality value and the additional audio quality value (Watanabe Column 5 Lines 51-60 - While a recognition score may represent a probability that a portion of audio data corresponds to a particular phoneme or word, the recognition score may also incorporate other information which indicates the ASR processing quality of the scored audio data relative to the ASR processing of other audio data. A recognition score may be represented as a number on a scale from 1 to 100, as a probability from 0 to 1, a log probability or other indicator. A recognition score may indicate a relative confidence that a section of audio data corresponds to a particular phoneme, word, etc.);
and determining whether to generate the first additional candidate representation of the spoken utterance locally at the first additional client device based on (a) the audio data and/or (b) the locally detected audio data capturing the spoken utterance detected at the first additional client device based on the comparing (Watanabe Column 5 Lines 51-60 - While a recognition score may represent a probability that a portion of audio data corresponds to a particular phoneme or word, the recognition score may also incorporate other information which indicates the ASR processing quality of the scored audio data relative to the ASR processing of other audio data. A recognition score may be represented as a number on a scale from 1 to 100, as a probability from 0 to 1, a log probability or other indicator. A recognition score may indicate a relative confidence that a section of audio data corresponds to a particular phoneme, word, etc.).

Regarding Claim 9, Watanabe teaches all of the limitations of claim 8. Watanabe also teaches that determining the audio quality value indicating the quality of the audio data capturing the spoken utterance detected at the client device comprises: identifying one or more microphones of the client device (Watanabe Column 3 Lines 47-49 - The audio capture device 212 may include a microphone or other suitable component for capturing sound.);
and determining the audio quality value based on the one or more microphones of the client device (Watanabe Column 6 Lines 5-8 - Various settings of the audio capture device 212 and/or input/output device interfaces 202 may be configured to adjust the audio data based on traditional tradeoffs of quality versus data size or other considerations. An input device interface that is mentioned here is the microphone.);
and wherein determining the additional audio quality value indicating the quality of the locally detected audio data capturing the spoken utterance detected at the first additional client device comprises: identifying one or more first additional microphones of the first additional client device (Watanabe Column 3 Lines 47-49 - The audio capture device 212 may include a microphone or other suitable component for capturing sound.);
and determining the additional audio quality value based on the one or more first additional microphones of the first additional client device (Watanabe Column 6 Lines 5-8 - Various settings of the audio capture device 212 and/or input/output device interfaces 202 may be configured to adjust the audio data based on traditional tradeoffs of quality versus data size or other considerations. An input device interface that is mentioned here is the microphone.).

Regarding Claim 10, Watanabe teaches all of the limitations of claim 8. Watanabe also teaches that determining the audio quality value indicating the quality of the audio data capturing the spoken utterance detected at the client device comprises: generating a signal to noise ratio value based on processing the audio data capturing the spoken utterance (Watanabe Column 6 Lines 25-37 - Received audio data may be sent to the AFE 216 for processing. The AFE 216 may reduce noise in the audio data, identify parts of the audio data containing speech for processing, and segment or portion and process the identified speech components. The AFE 216 may divide the digitized audio data into frames or audio segments, with each frame representing a time interval, for example 10 milliseconds (ms). During that frame the AFE 216 determines a set of values, called a feature vector, representing the features/qualities of the utterance portion within the frame. Feature vectors may contain a varying number of values, for example forty. The feature vector may represent different qualities of the audio data within the frame.);
and determining the audio quality value based on the signal to noise ratio value (Watanabe Column 6 Lines 37-41 - FIG. 4 shows a digitized audio data waveform 402, with multiple points 406 of the first word 404 as the first word 404 is being processed. The audio qualities of those points may be stored into feature vectors.);
and wherein determining the additional audio quality value indicating the quality of the locally detected audio data capturing the spoken utterance detected at the first additional client device comprises: generating an additional signal to noise ratio value based on processing the audio data capturing the spoken utterance (Watanabe Column 6 Lines 25-37 - Received audio data may be sent to the AFE 216 for processing. The AFE 216 may reduce noise in the audio data, identify parts of the audio data containing speech for processing, and segment or portion and process the identified speech components. The AFE 216 may divide the digitized audio data into frames or audio segments, with each frame representing a time interval, for example 10 milliseconds (ms). During that frame the AFE 216 determines a set of values, called a feature vector, representing the features/qualities of the utterance portion within the frame. Feature vectors may contain a varying number of values, for example forty. The feature vector may represent different qualities of the audio data within the frame.);
and determining the additional audio quality value based on the additional signal to noise ratio value (Watanabe Column 6 Lines 37-41 - FIG. 4 shows a digitized audio data waveform 402, with multiple points 406 of the first word 404 as the first word 404 is being processed. The audio qualities of those points may be stored into feature vectors.).

Regarding Claim 11, Watanabe teaches all of the limitations of claim 1. Watanabe also teaches that prior to receiving, at the client device, from the first additional client device and via the local network, a first additional candidate text representation of the spoken utterance, determining whether to transmit a request for the first additional candidate text representation of the spoken utterance to the first additional client device (Watanabe Column 6 Lines 65 -67 and Column 7 Lines 1-5 - The speech recognition engine 218 may process the output from the AFE 216 with reference to information stored in the speech storage 220. Alternatively, post front-end processed data (such as feature vectors) may be received by the ASR module 214 from another source besides the internal AFE 216. For example, another entity may process audio data into feature vectors and transmit that information to the ASR device 100 through the input device(s) 206.);
in response to determining to transmit the request for the first additional candidate text representation of the spoken utterance to the first additional client device, transmitting the request for the first additional candidate text representation of the spoken utterance to the first additional client device (Watanabe Column 6 Lines 65 -67 and Column 7 Lines 1-5 - The speech recognition engine 218 may process the output from the AFE 216 with reference to information stored in the speech storage 220. Alternatively, post front-end processed data (such as feature vectors) may be received by the ASR module 214 from another source besides the internal AFE 216. For example, another entity may process audio data into feature vectors and transmit that information to the ASR device 100 through the input device(s) 206.).

Regarding Claim 12, Watanabe teaches all of the limitations of claim 11. Watanabe also teaches that determining whether to transmit the request for the first additional candidate text representation of the spoken utterance to the first additional client device comprises: determining a hotword confidence score based on processing at least a portion of the audio data that captures the spoken utterance of the user using a hotword model, wherein the hotword confidence score indicates a probability of whether at least the portion of the audio data includes a hotword (Watanabe Column 5 Lines 36-48 - The different ways a spoken utterance may be interpreted may each be assigned a probability or a recognition score representing the likelihood that a particular set of words matches those spoken in the utterance. The recognition score may be based on a number of factors including, for example, the similarity of the sound in the utterance to models for language sounds (e.g., an acoustic model), and the likelihood that a particular word which matches the sounds would be included in the sentence at the specific location (e.g., using a language model or grammar). Based on the considered factors and the assigned recognition score, the ASR module 214 may output the most likely words recognized in the audio data.);
determining whether the hotword confidence score satisfies one or more conditions, wherein determining whether the hotword confidence score satisfies the one or more conditions comprises determining whether the hotword confidence score satisfies a threshold value (Watanabe Column 5 Lines 36-48 - The different ways a spoken utterance may be interpreted may each be assigned a probability or a recognition score representing the likelihood that a particular set of words matches those spoken in the utterance. The recognition score may be based on a number of factors including, for example, the similarity of the sound in the utterance to models for language sounds (e.g., an acoustic model), and the likelihood that a particular word which matches the sounds would be included in the sentence at the specific location (e.g., using a language model or grammar). Based on the considered factors and the assigned recognition score, the ASR module 214 may output the most likely words recognized in the audio data.);
in response to determining the hotword confidence score satisfies a threshold value, determining whether the hotword confidence score indicates a weak probability that at least the portion of the audio data includes the hotword (Watanabe Column 5 Lines 36-48 - The different ways a spoken utterance may be interpreted may each be assigned a probability or a recognition score representing the likelihood that a particular set of words matches those spoken in the utterance. The recognition score may be based on a number of factors including, for example, the similarity of the sound in the utterance to models for language sounds (e.g., an acoustic model), and the likelihood that a particular word which matches the sounds would be included in the sentence at the specific location (e.g., using a language model or grammar). Based on the considered factors and the assigned recognition score, the ASR module 214 may output the most likely words recognized in the audio data.);
and in response to determining the hotword confidence score indicates the weak probability that the at least the portion of the audio data includes the hotword, determining to transmit the request for the first additional candidate text representation of the spoken utterance to the first additional client device (Watanabe Column 5 Lines 36-48 - The different ways a spoken utterance may be interpreted may each be assigned a probability or a recognition score representing the likelihood that a particular set of words matches those spoken in the utterance. The recognition score may be based on a number of factors including, for example, the similarity of the sound in the utterance to models for language sounds (e.g., an acoustic model), and the likelihood that a particular word which matches the sounds would be included in the sentence at the specific location (e.g., using a language model or grammar). Based on the considered factors and the assigned recognition score, the ASR module 214 may output the most likely words recognized in the audio data.).

Regarding Claim 13, Watanabe teaches A non-transitory computer-readable medium configured to store instructions that, when executed by one or more processors, cause the one or more processors to perform operations that include (Watanabe Column 20 Lines 21-26 - a computer-readable storage medium, representing remote, local, fixed, and/or removable storage devices as well as storage media for temporarily and/or more permanently containing, storing, transmitting, and retrieving computer-readable information.):
detecting, at a client device, audio data that captures a spoken utterance of a user, wherein the client device is in an environment with one or more additional client devices and is in local communication with the one or more additional client devices via a local network, the one or more additional client devices including at least a first additional client device (Watanabe Column 3 Lines 41-51 - As illustrated in FIG. 2, the device 100 includes a variety of components which may communicate through an address/data bus 224. Each component may also be directly connected to other components in addition to (or instead of) being connected to other components across the bus 224. The ASR device 100 may include an audio capture device 212 for capturing spoken utterances for processing. The audio capture device 212 may include a microphone or other suitable component for capturing sound. The audio capture device 212 may be integrated into the ASR device 100 or may be separate from the ASR device 100.);
processing, at the client device, the audio data using an automatic speech recognition ("ASR") model stored locally at the client device to generate a candidate text representation of the spoken utterance (Watanabe Column 3 Lines 41-51 - As illustrated in FIG. 2, the device 100 includes a variety of components which may communicate through an address/data bus 224. Each component may also be directly connected to other components in addition to (or instead of) being connected to other components across the bus 224. The ASR device 100 may include an audio capture device 212 for capturing spoken utterances for processing. The audio capture device 212 may include a microphone or other suitable component for capturing sound. The audio capture device 212 may be integrated into the ASR device 100 or may be separate from the ASR device 100.);
receiving, at the client device, from the first additional client device and via the local network, a first additional candidate text representation of the spoken utterance, the first additional candidate text representation of the spoken utterance generated locally at the first additional client device is based on (a) the audio data and/or (b) locally detected audio data capturing the spoken utterance detected at the first additional client device, wherein the first additional candidate text representation of the spoken utterance is generated by processing the audio data and/or the locally generated audio data using a first additional ASR model stored locally at the first additional client device (Watanabe Column 5 Lines 19-35 - The device may also include an ASR module 214 for processing spoken audio data into text. The ASR module may be part of a speech processing module 240 or may be a separate component. The ASR module 214 transcribes audio data into text data representing the words of the speech contained in the audio data. The text data may then be used by other components for various purposes, such as executing system commands, inputting data, etc. Audio data including spoken utterances may be processed in real time or may be saved and processed at a later time. A spoken utterance in the audio data is input to the ASR module 214 which then interprets the utterance based on the similarity between the utterance and models known to the ASR module 214. For example, the ASR module 214 may compare the input audio data with models for sounds (e.g., speech units or phonemes) and sequences of sounds to identify words that match the sequence of sounds spoken in the utterance of the audio data.);
and determining a text representation of the spoken utterance based on the candidate text representation of the spoken utterance and the first additional candidate text representation of the spoken utterance generated by the first additional client device (Watanabe Column 5 Lines 19-35 - The device may also include an ASR module 214 for processing spoken audio data into text. The ASR module may be part of a speech processing module 240 or may be a separate component. The ASR module 214 transcribes audio data into text data representing the words of the speech contained in the audio data. The text data may then be used by other components for various purposes, such as executing system commands, inputting data, etc. Audio data including spoken utterances may be processed in real time or may be saved and processed at a later time. A spoken utterance in the audio data is input to the ASR module 214 which then interprets the utterance based on the similarity between the utterance and models known to the ASR module 214. For example, the ASR module 214 may compare the input audio data with models for sounds (e.g., speech units or phonemes) and sequences of sounds to identify words that match the sequence of sounds spoken in the utterance of the audio data.).

Regarding Claim 14, Watanabe teaches a system, comprising: one or more processors (Watanabe Column 3 Lines 52-55 - The device 100 includes one or more controllers/processors 204 for processing data and computer-readable instructions, and a memory 206 for storing data and processor-executable instructions.);
and memory configured to store instructions that, when executed by one or more processors, cause the one or more processors to perform operations that include: detecting, at a client device, audio data that captures a spoken utterance of a user, wherein the client device is in an environment with one or more additional client devices and is in local communication with the one or more additional client devices via a local network, the one or more additional client devices including at least a first additional client device (Watanabe Column 3 Lines 41-51 - As illustrated in FIG. 2, the device 100 includes a variety of components which may communicate through an address/data bus 224. Each component may also be directly connected to other components in addition to (or instead of) being connected to other components across the bus 224. The ASR device 100 may include an audio capture device 212 for capturing spoken utterances for processing. The audio capture device 212 may include a microphone or other suitable component for capturing sound. The audio capture device 212 may be integrated into the ASR device 100 or may be separate from the ASR device 100.);
processing, at the client device, the audio data using an automatic speech recognition ("ASR") model stored locally at the client device to generate a candidate text representation of the spoken utterance (Watanabe Column 3 Lines 41-51 - As illustrated in FIG. 2, the device 100 includes a variety of components which may communicate through an address/data bus 224. Each component may also be directly connected to other components in addition to (or instead of) being connected to other components across the bus 224. The ASR device 100 may include an audio capture device 212 for capturing spoken utterances for processing. The audio capture device 212 may include a microphone or other suitable component for capturing sound. The audio capture device 212 may be integrated into the ASR device 100 or may be separate from the ASR device 100.);
receiving, at the client device, from the first additional client device and via the local network, a first additional candidate text representation of the spoken utterance, the first additional candidate text representation of the spoken utterance generated locally at the first additional client device is based on (a) the audio data and/or (b) locally detected audio data capturing the spoken utterance detected at the first additional client device, wherein the first additional candidate text representation of the spoken utterance is generated by processing the audio data and/or the locally generated audio data using a first additional ASR model stored locally at the first additional client device (Watanabe Column 5 Lines 19-35 - The device may also include an ASR module 214 for processing spoken audio data into text. The ASR module may be part of a speech processing module 240 or may be a separate component. The ASR module 214 transcribes audio data into text data representing the words of the speech contained in the audio data. The text data may then be used by other components for various purposes, such as executing system commands, inputting data, etc. Audio data including spoken utterances may be processed in real time or may be saved and processed at a later time. A spoken utterance in the audio data is input to the ASR module 214 which then interprets the utterance based on the similarity between the utterance and models known to the ASR module 214. For example, the ASR module 214 may compare the input audio data with models for sounds (e.g., speech units or phonemes) and sequences of sounds to identify words that match the sequence of sounds spoken in the utterance of the audio data.);
and determining a text representation of the spoken utterance based on the candidate text representation of the spoken utterance and the first additional candidate text representation of the spoken utterance generated by the first additional client device (Watanabe Column 5 Lines 19-35 - The device may also include an ASR module 214 for processing spoken audio data into text. The ASR module may be part of a speech processing module 240 or may be a separate component. The ASR module 214 transcribes audio data into text data representing the words of the speech contained in the audio data. The text data may then be used by other components for various purposes, such as executing system commands, inputting data, etc. Audio data including spoken utterances may be processed in real time or may be saved and processed at a later time. A spoken utterance in the audio data is input to the ASR module 214 which then interprets the utterance based on the similarity between the utterance and models known to the ASR module 214. For example, the ASR module 214 may compare the input audio data with models for sounds (e.g., speech units or phonemes) and sequences of sounds to identify words that match the sequence of sounds spoken in the utterance of the audio data.).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure: D’Amato et al. (US 20210035561 A1), Inose et al. (US 20140303969 A1), and Alfred et al. (US 8244543 B2).
D’Amato et al. (US 20210035561 A1) teaches that a “playback device transmits the input sound data to a second playback device over a local area network, the second playback device employing a second local NLU with a second predetermined library of keywords” (D’Amato – Abstract).
Inose et al. (US 20140303969 A1) teaches “a speech recognition control device has a plurality of microphones placed at different positions, a speech transmission control unit, and a speech recognition execution control unit” (Inose – Abstract).
Alfred et al. (US 8244543 B2) teaches “a system, method and computer-readable medium for using speech recognition to control devices connected to a network” (Alfred – Abstract).
Please, see additional references in form PTO-892 for more details.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to UTHEJ KUNAMNENI whose telephone number is (571)272-5428. The examiner can normally be reached M-F 8:00-5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on (571) 272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/UTHEJ KUNAMNENI/               Examiner, Art Unit 2656                                                                                                                                                                                         
/EDGAR X GUERRA-ERAZO/               Primary Examiner, Art Unit 2656