Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on December 7, 2020 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Response to Arguments
	The amendment filed on June 27, 2022 has been entered. Claims 1-6, 8-10, 13, 15-17, 19 and 21-27 remain pending in the application.
	The applicant argues that the previous office action fails to anticipate “using a speech recognition model stored locally at the client device” and “updating one or more weights of the speech recognition model based on the generated gradient”. As discussed in the interview on June 23, 2022, the examiner agrees with this assertion.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically taught as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
	
Claims 1-6,8-10,13,15-17,19 and 21-24, 26-27 are rejected under 35 U.S.C. 103 as being unpatentable over Thomson (U.S. Patent No. 10388272) in view of Case (U.S. Publication No. 20200380369).
Regarding claim 1, Thomson discloses a method performed by one or more processors of a client device, (Col 6 – Rows 50-54 – each of the first device 104 and the second device 106 may include memory and at least one processor, which are configured to perform operations as described in this disclosure) the method comprising:
receiving, via one or more microphones of the client device, audio data that captures a spoken utterance of a user of the client device (Col 7 – Rows 19-22 – The first audio may include a first voice of the first user 110…the first audio from a microphone of the first device 104);
processing the audio data to generate a predicted textual segment that is a prediction of the spoken utterance (Col 8 – Rows 17-19 – the transcription system 108 may be configured to obtain audio from a device, generate or direct generation of a transcription of the audio),
wherein processing the audio data to generate the predicted textual segment comprises (Col 8 – Rows 17-19 – the transcription system 108 may be configured to obtain audio from a device, generate or direct generation of a transcription of the audio):
and determining the predicted textual segment based on the predicted output (Col 8 – Rows 25-28 – the transcription system 108 may be configured to generate or direct generation of the transcription of audio using one or more automatic speech recognition ((ASR) systems);
causing at least part of the predicted textual segment to be visually rendered at a display of the client device (Col 5 – Rows 29-30 – single transcription that is provided to a device for display to a user);
receiving, responsive to the at least part of the predicted textual segment being visually rendered, further user interface input that is a correction of the predicted textual segment to an alternate textual segment (Col 20 – Rows 14-16 – The first device 104 may be configured to receive input from the first user 110 such that the first user 110 may mark words that were transcribed incorrectly);
and responsive to the further user interface input being the correction of the predicted textual segment to the alternate textual segment (Col 20 – Rows 20-21 – user feedback may be used to improved accuracy):
and updating one or more weights of the speech recognition model based on the generated gradient (Figure 83 – Interpolation Weight Estimator 8904, [1128] - In these and other embodiments, the on-the-fly interpolation may use a gradient descent algorithm to adjust the interpolation weights).
However, Thomson does not disclose a method of processing, using a speech recognition model stored locally at the client device, the audio data to generate a predicted output;
and generating a gradient based on comparing at least part of the predicted output to ground truth output that corresponds to the alternate textual segment.
Case does teach a method of processing, using a speech recognition model stored locally at the client device, the audio data to generate a predicted output (Figure 14- Input Devices 1408; [0098] - hardware resources described above may be configured as a service to allow users to train or performing inferencing of information, such as…speech recognition [0209] - computer system 1200 may include, without limitation, processor 1202 that may include, without limitation, one or more execution units 1208 to perform machine learning model training and/or inferencing);
and generating a gradient based on comparing at least part of the predicted output to ground truth output that corresponds to the alternate textual segment ([0075] - a gradient is computed based on an error that is computed using ground truth data).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the application to modify the teaching of Thomson to include the teachings of Case in order to implement a method of processing, using a speech recognition model stored locally at the client device, the audio data to generate a predicted output; and generating a gradient based on comparing at least part of the predicted output to ground truth output that corresponds to the alternate textual segment. Doing so allows users to train or perform inferencing of information directly onto hardware (Case [0098]).
Regarding claim 2, Thomson in view of Case teaches all limitations of claim 1, above.
Thomson teaches the method, further comprising:
determining that the correction is directed to performance of the speech recognition model (Figure 25 – Performance Tracker 2508, Col 26 – Rows 6-9 – predicted ASR system accuracy for the speaker which may be based on or include previous ASR system accuracy for the speaker, and the CA’s estimated performance),
wherein generating the gradient and updating the one or more weights is further responsive to determining that the correction is directed to performance of the speech recognition model (Figure 83 – Interpolation Weight Estimator 8904, Language Model Trainer 8920, [1128] - In these and other embodiments, the on-the-fly interpolation may use a gradient descent algorithm to adjust the interpolation weights).
Regarding claim 3, Thomson in view of Case teaches all limitations of claim 2, above.
Thomson teaches the method, wherein the display is a touch-display (Table I – 9. A CA client generating a transcription via an input device (e.g., … touch screen)),
wherein the further user interface input comprises one or more touch inputs, directed at the display of the device, to modify a term of the predicted textual segment to create a modified term and/or to replace a term of the predicted textual segment with a replacement term (Table I – 9. A CA client generating a transcription via an input device (e.g., … touch screen), Col 20 – Rows 14-16 – The first device 104 may be configured to receive input from the first user 110 such that the first user 110 may mark words that were transcribed incorrectly),
and wherein the alternate textual segment includes the modified term or the replacement term (Col 20 – Rows 25-28 – In some embodiments, either one or both of the first device 104 and the second device 106 may be configured to display a selected message before, during, or after transcriptions are received from the transcription system 108).
Regarding claim 4, Thomson in view of Case teaches all limitations of claim 3, above.
Thomson teaches the method, wherein determining that the correction is directed to performance of the speech recognition model comprises:
determining a measure of similarity between the term of the predicted textual segment and the modified term or the replacement term of the alternate textual segment (Figure 49 – Accuracy Measurement Service 5316, example environment for measuring accuracy of a transcription service);
and determining that the correction is directed to performance of the speech recognition model based on the measure of similarity satisfying a threshold (Figure 24 – Threshold 2410, Figure 25 – Performance Tracker 2508, Col 26 – Rows 6-9 – predicted ASR system accuracy for the speaker which may be based on or include previous ASR system accuracy for the speaker, and the CA’s estimated performance, [224] - In some embodiments, the accuracy requirement may be associated with a selection threshold value).
Regarding claim 5, Thomson in view of Case teaches all limitations of claim 4, above.
Thomson teaches the method, wherein determining the measure of similarity between the term of the predicted textual segment and the modified term or the replacement term of the alternate textual segment comprises:
determining an acoustic similarity between the term and the modified term or the replacement term (Cols 25 – Rows 27-50 – each profile may include one or more of: levels of multiple skills such as speed, accuracy, an ability to revoice communication sessions in noise or in other adverse acoustic environments),
determining an edit distance similarity between the term and the modified term or the replacement term ([362] – Table 5 – a edit distance…between two transcriptions may be a feature),
and/or determining whether the modified term or the replacement term is a candidate term indicated by the predicted output ([311] - The ways in which a term may be added may include: adding an entry to the lexicon based on input from a CA, adding a term to a list of problem terms or difficult-to-recognize terms for training by a module used by the ASR system 1220, and obtaining a term from the text editor based on the term being applied as an edit or correction of a transcription);
andPage 3 of 11Attorney Docket No. ZS202-20692Preliminary Amendment determining the measure of similarity based on the acoustic similarity, the edit distance similarity, and/or whether the modified term or the replacement term is a candidate term indicated by the predicted output (Cols 25 – Rows 27-50 – each profile may include one or more of: levels of multiple skills such as speed, accuracy, an ability to revoice communication sessions in noise or in other adverse acoustic environments, [311] - The ways in which a term may be added may include: adding an entry to the lexicon based on input from a CA, adding a term to a list of problem terms or difficult-to-recognize terms for training by a module used by the ASR system 1220, and obtaining a term from the text editor based on the term being applied as an edit or correction of a transcription. [362] – Table 5 – a edit distance…between two transcriptions may be a feature).
Regarding claim 6, Thomson in view of Case teaches all limitations of claim 1, above.
Thomson teaches the method, further comprising:
determining the alternate textual segment is an alternate predicted textual segment based on the predicted output, and is in addition to the predicted textual segment ([353] - Another additional criterion may be to consider tokens from an alternate hypothesis generated by an ASR system. For example, an ASR system may generate multiple ranked hypotheses for a segment of audio);
causing at least part of the alternate predicted textual segment to be visually rendered, at the display of the client device, along with rendering of the predicted textual segment ([783] - The user device may display the transcription or correction on the display and/or it may store it in a storage location such as a display buffer or audio record);
and wherein the further user interface input comprises a selection of the at least part of the alternate predicted textual segment in lieu of selection of the at least part of the predicted textual segment ([512] - one or more of the ASR systems 2120 may generate additional information such as: 1. Alternate transcriptions… [513] - The additional information may be provided to the selector 2106 for use in determining control decisions).
Regarding claim 8, Thomson in view of Case teaches all limitations of claim 1, above.
Thomson teaches the method, wherein the further user interface input comprises a further spoken utterance, of the user, that is captured in further audio data received via one or more of the microphones, and further comprising (Col 7 – Rows 19-22 – The first audio may include a first voice of the first user 110…the first audio from a microphone of the first device 104, [112] -  In some embodiments, the audio used by the ASR systems may be revoiced audio. Revoiced audio may include audio that has been received by the transcription system 108 and gone through a revoicing process. The revoicing process may include the transcription system 108 obtaining audio from either one or both of the first device 104 and the second device 106):
processing, using the speech recognition model stored locally at the client device, the further audio data to generate a further predicted output (Col 8 – Rows 25-28 – the transcription system 108 may be configured to generate or direct generation of the transcription of audio using one or more automatic speech recognition ((ASR) systems);
determining, based on the further predicted output, that the alternate textual segment is a candidate prediction for the further spoken utterance ([311] - The ways in which a term may be added may include: adding an entry to the lexicon based on input from a CA, adding a term to a list of problem terms or difficult-to-recognize terms for training by a module used by the ASR system 1220, and obtaining a term from the text editor based on the term being applied as an edit or correction of a transcription);
and determining that further user interface input is a correction of the predicted textual segment to the alternate textual segment based at least in part on: the alternate textual segment being a candidate prediction for the further spoken utterance ([114] - In some embodiments, revoiced audio may be provided to a speaker-independent ASR system. In these and other embodiments, the speaker-independent ASR system may not be specifically trained using speech patterns of the CA revoicing the audio. Alternatively or additionally, revoiced audio may be provided to a speaker-dependent ASR system. In these and other embodiments, the speaker-dependent ASR system may be specifically trained using speech patterns of the CA revoicing the audio);
and determining that the further spoken utterance is a repeat of the spoken utterance ([114] - In some embodiments, revoiced audio may be provided to a speaker-independent ASR system. In these and other embodiments, the speaker-independent ASR system may not be specifically trained using speech patterns of the CA revoicing the audio. Alternatively or additionally, revoiced audio may be provided to a speaker-dependent ASR system. In these and other embodiments, the speaker-dependent ASR system may be specifically trained using speech patterns of the CA revoicing the audio).
Regarding claim 9, Thomson in view of Case teaches all limitations of claim 8, above.
Thomson teaches the method, further comprising:
determining, based on the predicted output, that the alternate textual segment is also a candidate prediction for the spoken utterance ([311] - The ways in which a term may be added may include: adding an entry to the lexicon based on input from a CA, adding a term to a list of problem terms or difficult-to-recognize terms for training by a module used by the ASR system 1220, and obtaining a term from the text editor based on the term being applied as an edit or correction of a transcription);
wherein determining that the further spoken utterance is the repeat of the spoken utterance is based at least in part on determining that the alternate textual segment is a candidate prediction for both the spoken utterance and the additional spoken utterance ([114] - In some embodiments, revoiced audio may be provided to a speaker-independent ASR system. In these and other embodiments, the speaker-independent ASR system may not be specifically trained using speech patterns of the CA revoicing the audio. Alternatively or additionally, revoiced audio may be provided to a speaker-dependent ASR system. In these and other embodiments, the speaker-dependent ASR system may be specifically trained using speech patterns of the CA revoicing the audio).
Regarding claim 10, Thomson in view of Case teaches all limitations of claim 8, above.
Thomson teaches the method, wherein determining that the further spoken utterance is the repeat of the spoken utterance is based on:
acoustic similarity between the audio data and the further audio data (Cols 25 – Rows 27-50 – each profile may include one or more of: levels of multiple skills such as speed, accuracy, an ability to revoice communication sessions in noise or in other adverse acoustic environments),
output similarity between the predicted output and the further predicted output ([521] - The scorer 2216 may be configured to evaluate similarity between two token strings, such as two transcriptions),
and/or a duration of time between the spoken utterance and the further spoken utterance ([596] - duration of measurement window for predicting or evaluating the accuracy).
Regarding claim 13, Thomson in view of Case teaches all limitations of claim 1, above.
Thomson teaches the method, further comprising:
transmitting, over a network to a remote system, the generated gradient without transmitting any of: the predicted textual segment, the audio data, and the alternate textual segment ([1034] - example environment 7500 for training models from fused transcriptions… processing center 7501, [1035] - The model trainer 7522 may update the models 7504 on-the-fly, using, for example gradient descent or other iterative methods, [1128] - In these and other embodiments, on-the-fly interpolation weight determination may avoid recording audio or text),
wherein the remote system utilizes the generated gradient, and additional gradients from additional client devices, to update global weights of a global speech recognition model ([1035] - The model trainer 7522 may update the models 7504 on-the-fly, using, for example gradient descent or other iterative methods, [1128] - In these and other embodiments, on-the-fly interpolation weight determination may avoid recording audio or text).
Regarding claim 15, Thomson in view of Case teaches all limitations of claim 13, above.
Thomson teaches the method, further comprising:
receiving, at the client device and from the remote system, the global speech recognition model, wherein receiving the global speech recognition model is subsequent to the remote system updating the global weights of the global speech recognition model based on the generated gradient and the additional gradients ([1126] -  In some embodiments, a language model trainer 8920 may create or adapt the domain language model 8901 using the communication session data from the current communication session or the current and past communication sessions. [1128] - In these and other embodiments, the on-the-fly interpolation may use a gradient descent algorithm to adjust the interpolation weights);
and responsive to receiving the global speech recognition model, replacing, in local storage of the client device, the speech recognition model with the global speech recognition model ([542] - A transcription service provider establishes a global metric of minimizing cost while providing overall accuracy).
Regarding claim 16, Thomson in view of Case teaches all limitations of claim 13, above.
Thomson teaches the method, further comprising:
receiving, at the client device and from the remote system, the updated global weights, wherein receiving the updated global weights is subsequent to the remote system updating the global weights of the global speech recognition model based on the generated gradient and the additional gradients ([1126] -  In some embodiments, a language model trainer 8920 may create or adapt the domain language model 8901 using the communication session data from the current communication session or the current and past communication sessions. [1128] - In these and other embodiments, the on-the-fly interpolation may use a gradient descent algorithm to adjust the interpolation weights);
and responsive to receiving the updated global weights, replacing in local storage of the client device weights of the speech recognition model with the updated global weights ([1126] -  In some embodiments, a language model trainer 8920 may create or adapt the domain language model 8901 using the communication session data from the current communication session or the current and past communication sessions. [1145] - In some embodiments, the processor 9110 may be configured to interpret and/or execute program instructions and/or process data stored in the memory 9112, the data storage 9114).
Regarding claim 17, Thomson in view of Case teaches all limitations of claim 1, above.
Thomson teaches the method, further comprising:
determining, based on sensor data from one or more sensors of the client device, that a current state of the client device satisfies one or more conditions ([1141] - The transcription unit may correct errors in the ASR result to create a corrected transcription, which may be sent back to the user device for display or to correct previously displayed transcriptions),
wherein generating the gradient, and/or updating the one or more weights are performed responsive to determining that the current state of the client device satisfies the one or more conditions ([1128] – In some embodiments, interpolation weight determination by the interpolation weight estimator 8904 may use on-the-fly interpolation where interpolation weights are assigned a set of initial values and adjusted based on data from each communication session).
Regarding claim 19, Thomson discloses a method performed by one or more processors of a client device, the method comprising (Col 6 – Rows 50-54 – In some embodiments, each of the first device 104 and the second device 106 may include memory and at least one processor, which are configured to perform operations as described in this disclosure):
receiving, via one or more microphones of the client device, audio data that captures a spoken utterance of a user of the client device (Col 7 – Rows 19-22 – The first audio may include a first voice of the first user 110…the first audio from a microphone of the first device 104);
processing the audio data to generate a predicted textual segment that is a prediction of the spoken utterance (Col 8 – Rows 17-19 – the transcription system 108 may be configured to obtain audio from a device, generate or direct generation of a transcription of the audio),
wherein processing the audio data to generate the predicted textual segment comprises (Col 8 – Rows 17-19 – the transcription system 108 may be configured to obtain audio from a device, generate or direct generation of a transcription of the audio):
and determining the predicted textual segment based on the predicted output (Col 8 – Rows 25-28 – the transcription system 108 may be configured to generate or direct generation of the transcription of audio using one or more automatic speech recognition ((ASR) systems);
causing at least part of the predicted textual segment to be visually rendered at a display of the client device (Col 5 – Rows 29-30 – single transcription that is provided to a device for display to a user);
receiving, responsive to the at least part of the predicted textual segment being visually rendered, further user interface input that is a correction of the predicted textual segment to an alternate textual segment (Col 20 – Rows 14-16 – The first device 104 may be configured to receive input from the first user 110 such that the first user 110 may mark words that were transcribed incorrectly);
and responsive to the further user interface input being the correction of the predicted textual segment to the alternate textual segment (Col 20 – Rows 20-21 – user feedback may be used to improved accuracy):
and transmitting, over a network to a remote system, the generated gradient without transmitting any of ([1034] - example environment 7500 for training models from fused transcriptions… processing center 7501 [1035] - The model trainer 7522 may update the models 7504 on-the-fly, using, for example gradient descent or other iterative methods):
the predicted textual segment, the audio data, and the alternate textual segment ([1128] - In these and other embodiments, on-the-fly interpolation weight determination may avoid recording audio or text),
wherein the remote system utilizes the generated gradient, and additional gradients from additional client devices, to update global weights of a global speech recognition model ([1035] - The model trainer 7522 may update the models 7504 on-the-fly, using, for example gradient descent or other iterative methods. [1128] - In these and other embodiments, the on-the-fly interpolation may use a gradient descent algorithm to adjust the interpolation weights).
However, Thomson does not disclose a method of processing, using a speech recognition model stored locally at the client device, the audio data to generate a predicted output;
and generating a gradient based on comparing at least part of the predicted output to ground truth output that corresponds to the alternate textual segment.
Case does teach a method of processing, using a speech recognition model stored locally at the client device, the audio data to generate a predicted output (Figure 14- Input Devices 1408; [0098] - hardware resources described above may be configured as a service to allow users to train or performing inferencing of information, such as…speech recognition [0209] - computer system 1200 may include, without limitation, processor 1202 that may include, without limitation, one or more execution units 1208 to perform machine learning model training and/or inferencing);
and generating a gradient based on comparing at least part of the predicted output to ground truth output that corresponds to the alternate textual segment ([0075] - a gradient is computed based on an error that is computed using ground truth data).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the application to modify the teaching of Thomson to include the teachings of Case in order to implement a method of processing, using a speech recognition model stored locally at the client device, the audio data to generate a predicted output; and generating a gradient based on comparing at least part of the predicted output to ground truth output that corresponds to the alternate textual segment. Doing so allows users to train or perform inferencing of information directly onto hardware (Case [0098]).
Regarding claim 21, Thomson in view of Case teaches all limitations of claim 19, above.
Thomson teaches the method, further comprising:
receiving, at the client device and from the remote system, the global speech recognition model, wherein receiving the global speech recognition model is subsequent to the remote system updating the global weights of the global speech recognition model based on the generated gradient and the additional gradients ([1126] -  In some embodiments, a language model trainer 8920 may create or adapt the domain language model 8901 using the communication session data from the current communication session or the current and past communication sessions. [1128] - In these and other embodiments, the on-the-fly interpolation may use a gradient descent algorithm to adjust the interpolation weights);
and Page 7 of 11Attorney Docket No. ZS202-20692Preliminary Amendmentresponsive to receiving the global speech recognition model, replacing, in local storage of the client device, the speech recognition model with the global speech recognition model ([1126] -  In some embodiments, a language model trainer 8920 may create or adapt the domain language model 8901 using the communication session data from the current communication session or the current and past communication sessions. [1145] - In some embodiments, the processor 9110 may be configured to interpret and/or execute program instructions and/or process data stored in the memory 9112, the data storage 9114).
Regarding claim 22, Thomson in view of Case teaches all limitations of claim 19, above.
Thomson teaches the method, further comprising:
receiving, at the client device and from the remote system, the updated global weights, wherein receiving the updated global weights is subsequent to the remote system updating the global weights of the global end-to-end speech recognition model based on the gradient and the additional gradients ([1126] -  In some embodiments, a language model trainer 8920 may create or adapt the domain language model 8901 using the communication session data from the current communication session or the current and past communication sessions. [1128] - In these and other embodiments, the on-the-fly interpolation may use a gradient descent algorithm to adjust the interpolation weights);
and responsive to receiving the updated global weights, replacing in local storage of the client device weights of the speech recognition model with the updated global weights ([1126] -  In some embodiments, a language model trainer 8920 may create or adapt the domain language model 8901 using the communication session data from the current communication session or the current and past communication sessions. [1145] - In some embodiments, the processor 9110 may be configured to interpret and/or execute program instructions and/or process data stored in the memory 9112, the data storage 9114).
Regarding claim 23, Thomson in view of Case teaches all limitations of claim 19, above.
Thomson teaches the method, further comprising:
determining, based on sensor data from one or more sensors of the client device, that a current state of the client device satisfies one or more conditions ([1141] - The transcription unit may correct errors in the ASR result to create a corrected transcription, which may be sent back to the user device for display or to correct previously displayed transcriptions),
wherein generating the gradient, and/or updating the one or more weights are performed responsive to determining that the current state of the client device satisfies the one or more conditions ([1128] – In some embodiments, interpolation weight determination by the interpolation weight estimator 8904 may use on-the-fly interpolation where interpolation weights are assigned a set of initial values and adjusted based on data from each communication session).
Regarding claim 24, Thomson in view of Case teaches all limitations of claim 19, above.
Thomson teaches the method, further comprising:
determining that the correction is directed to performance of the speech recognition model (Figure 25 – Performance Tracker 2508, Col 26 – Rows 6-9 – predicted ASR system accuracy for the speaker which may be based on or include previous ASR system accuracy for the speaker, and the CA’s estimated performance),
wherein generating the gradient and updating the one or more weights is further responsive to determining that the correction is directed to performance of the speech recognition model (Figure 83 – Interpolation Weight Estimator 8904, Language Model Trainer 8920, [1128] - In these and other embodiments, the on-the-fly interpolation may use a gradient descent algorithm to adjust the interpolation weights).
Regarding claim 26, Thomson in view of Case teaches all limitations of claim 19, above.
Thomson teaches the method, further comprising:
determining the alternate predicted textual segment is an alternate predicted textual segment based on the predicted output, and is in addition to the predicted textual segment ([353] - Another additional criterion may be to consider tokens from an alternate hypothesis generated by an ASR system. For example, an ASR system may generate multiple ranked hypotheses for a segment of audio);
causing at least part of the alternate predicted textual segment to be visually rendered, at the display of the client device, along with rendering of the predicted textual segment ([783] - The user device may display the transcription or correction on the display and/or it may store it in a storage location such as a display buffer or audio record);
and wherein the further user interface input comprises a selection of the at least part of the alternate predicted textual segment in lieu of selection of the at least part of the predicted textual segment ([512] - one or more of the ASR systems 2120 may generate additional information such as: 1. Alternate transcriptions [513] - The additional information may be provided to the selector 2106 for use in determining control decisions).
Regarding claim 27, Thomson discloses a client device comprising (Col 6 – Rows 50-54 – In some embodiments, each of the first device 104):
at least one microphone (Col 7 – Rows 19-22 – microphone of the first device 104);
at least one display (Table I – 9. A CA client generating a transcription via an input device (e.g., … touch screen));
at least one speaker (Col 6 – Rows 43 – a speakerphone…. A smart speaker);
and one or more processors executing locally stored instructions to cause the processors to perform operations comprising (Col 6 – Rows 50-54 – In some embodiments, each of the first device 104 and the second device 106 may include memory and at least one processor, which are configured to perform operations as described in this disclosure):
receiving, via one or more microphones of the client device, audio data that captures a spoken utterance of a user of the client device (Col 7 – Rows 19-22 – The first audio may include a first voice of the first user 110…the first audio from a microphone of the first device 104);
processing the audio data to generate a predicted textual segment that is a prediction of the spoken utterance (Col 8 – Rows 17-19 – the transcription system 108 may be configured to obtain audio from a device, generate or direct generation of a transcription of the audio),
wherein processing the audio data to generate the predicted textual segment comprises (Col 8 – Rows 17-19 – the transcription system 108 may be configured to obtain audio from a device, generate or direct generation of a transcription of the audio):
and determining the predicted textual segment based on the predicted output (Col 8 – Rows 25-28 – the transcription system 108 may be configured to generate or direct generation of the transcription of audio using one or more automatic speech recognition ((ASR) systems);
causing at least part of the predicted textual segment to be visually rendered at a display of the client device (Col 5 – Rows 29-30 – single transcription that is provided to a device for display to a user);
receiving, responsive to the at least part of the predicted textual segment being visually rendered, further user interface input that is a correction of the predicted textual segment to an alternate textual segment (Col 20 – Rows 14-16 – The first device 104 may be configured to receive input from the first user 110 such that the first user 110 may mark words that were transcribed incorrectly);
and responsive to the further user interface input being the correction of the predicted textual segment to the alternate textual segment (Col 20 – Rows 20-21 – user feedback may be used to improved accuracy):
and updating one or more weights of the speech recognition model based on the generated gradient (Figure 83 – Interpolation Weight Estimator 8904, Language Model Trainer 8920, [1128] - In these and other embodiments, the on-the-fly interpolation may use a gradient descent algorithm to adjust the interpolation weights).
However, Thomson does not disclose a client device, comprising: processing, using a speech recognition model stored locally at the client device, the audio data to generate a predicted output;
and generating a gradient based on comparing at least part of the predicted output to ground truth output that corresponds to the alternate textual segment.
Case does teach a client device, comprising: using a speech recognition model stored locally at the client device, the audio data to generate a predicted output (Figure 14- Input Devices 1408; [0098] - hardware resources described above may be configured as a service to allow users to train or performing inferencing of information, such as…speech recognition [0209] - computer system 1200 may include, without limitation, processor 1202 that may include, without limitation, one or more execution units 1208 to perform machine learning model training and/or inferencing);
and generating a gradient based on comparing at least part of the predicted output to ground truth output that corresponds to the alternate textual segment ([0075] - a gradient is computed based on an error that is computed using ground truth data).
It would have been obvious to a person of ordinary skill in the art before the effective filing date of the application to modify the teaching of Thomson to include the teachings of Case in order to implement a client device, comprising: processing, using a speech recognition model stored locally at the client device, the audio data to generate a predicted output; and generating a gradient based on comparing at least part of the predicted output to ground truth output that corresponds to the alternate textual segment. Doing so allows users to train or perform inferencing of information directly onto hardware (Case [0098]).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Davis (U.S. Publication No. 20120277893) teaches channelized audio watermarks. Kumar (U.S. Patent No. 10335572) teaches systems and methods for computer assisted operation. Mirowski (U.S. Publication No. 20120150532) teaches a system and method for feature-rich continuous space language models. Ramer (U.S. Patent No. 9703892) teaches predictive text completion for a mobile communication facility. Sicconi (U.S. Publication No. 20200057287) teaches methods and systems for using artificial intelligence to evaluate, correct, and monitor user attentiveness. Tran (U.S. Patent No. 10325596) teaches voice control of appliances. Tran (U.S. Publication No. 20170312614) teaches a smart device.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ETHAN DANIEL KIM whose telephone number is (571) 272-1405.  The examiner can normally be reached on Monday - Friday 9:00 - 5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached on (571) 272-7602. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/ETHAN DANIEL KIM/
Examiner, Art Unit 2658

/RICHEMOND DORVIL/Supervisory Patent Examiner, Art Unit 2658