DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .



Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1-4, 7-14, 16-19 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Casado U.S. PAP 8,862,467 B1.
Regarding claim 1 Casado teaches a method implemented by one or more processors, the method comprising: 
receiving, via one or more microphones of a computing device of a user, audio data corresponding to a spoken utterance of the user (request to transcribe spoken input from a user of a computing device, see abstract); 
processing, using an automatic speech recognition (ASR) model, the audio data corresponding to the spoken utterance to generate, for one or more parts of the spoken utterance, a plurality of speech hypotheses based on values generated using the ASR model (determine, based on the information that characterizes the spoken input, multiple hypotheses that each represent a possible textual transcription of the spoken input, see abstract); 
selecting, from among the plurality of speech hypotheses, a given speech hypothesis, the given speech hypothesis being predicted to correspond to one or more of the parts of the spoken utterance based on the values (select, based on the context information, one or more of the multiple hypotheses for the spoken input as one or more likely intended hypotheses for the spoken input, see abstract); 
causing the given speech hypothesis to be incorporated as a portion of a transcription, the transcription being associated with a software application that is accessible by at least the computing device, and the transcription being visually rendered at a user interface of the computing device of the user (receiving, by a server system, a first transcription request and a second transcription request, each of the first and second transcription requests including (i) respective information that characterizes respective spoken input from a user of a computing device, and (ii) respective context information associated with the user or the computing device, present a list of the hypotheses to the user see col. 8 lines 45-50); 
storing the plurality of speech hypotheses in memory that is accessible by at least the computing device ( The memory 564 stores information within the mobile computing device 550. The memory 564 can be implemented as one or more of a computer-readable medium or media, a volatile memory unit or units, or a non-volatile memory unit or units, See col. 17 lines 28-33); 
and transmitting the plurality of speech hypotheses, wherein transmitting the plurality of speech hypotheses causes the plurality of speech hypotheses to be loaded at an additional computing device of the user when the transcription associated with the software application is accessed at the additional computing device, the additional computing device being in addition to the computing device (the one or more hypotheses to be sent to the client device in response to the request, see col. 3 lines 52-55).  
Regarding claim 2 Casado teaches the method of claim 1, further comprising: determining a respective confidence level associated with each of the plurality of speech hypotheses, for one or more of the parts of the spoken utterance, based on the values generated using the ASR model, wherein selecting the given speech hypothesis, from among the plurality of speech hypotheses, predicted to correspond to one or more of the parts of the spoken 56Attorney Docket No. ZS202-20835 utterance is based on the respective confidence level associated with each of the plurality of speech hypotheses (the one or more hypotheses may be assigned scores that reflect a confidence as to how likely each hypothesis likely matches the user's intended input, see col. 6 lines 14-16).  
Regarding claim 3 Casado teaches the method of claim 2, wherein storing the plurality of speech hypotheses in the memory that is accessible by at least the computing device is in response to determining that the respective confidence level for two or more of the plurality of speech hypotheses, for one or more of the part of the spoken utterance, are within a threshold range of confidence levels ( the transcription server may identify and score only terms that are associated with the context information. For example, each of the contacts "Sam Forester," "Barnabas Smith," "Cameron Callie," and "Barack Stevens" may be tested against the spoken input, and one or more of the contacts that score highest or that exceed a certain threshold score may be selected and returned to the client device 124, see col. 7, lines 43-49).  
Regarding claim 4 Casado teaches the method of claim 2, wherein storing the plurality of speech hypotheses in the memory that is accessible by at least the computing device is in response to determining that the respective confidence level for each of the plurality of speech hypotheses, for the part of the spoken utterance, fail to satisfy a threshold confidence level (In some implementations, the context information may also be used to identify other terms to include in the likely intended hypothesis that were not identified in the set of initial hypotheses, see col. 15, lines 20-23).  

Regarding claim 7 Casado teaches the method of claim 2, wherein storing the plurality of speech hypotheses in the memory that is accessible by at least the computing device comprises storing each the plurality of speech hypotheses in association with the respective confidence level in the memory that is accessible by at least the computing device (the hypothesis generator 228 may have identified a set of possible hypotheses and assigned confidence scores to each of the possible hypotheses, see col. 12, lines 22-27).  
Regarding claim 8 Casado teaches the method of claim 2, further comprising: 
receiving, via one or more additional microphones of the additional computing device, additional audio data corresponding to an additional spoken utterance of the user (receiving, by the computer system at a later time, a second request to transcribe spoken input from the user of the computing device, see col. 2, lines 1-3); 
processing, using the ASR model, the additional audio data corresponding to the additional spoken utterance to generate, for an additional part of the additional spoken utterance, a plurality of additional speech hypotheses based on additional values generated using the ASR model (the second request including (i) information that characterizes a second spoken input, and (ii) second context information associated with the user or the computing device. The method can determine, based on the information that characterizes the second spoken input, multiple hypotheses that each represent a possible textual transcription of the second spoken input, see col. 2 lines 3-9); 
and modifying the given speech hypothesis, for the part of the spoken utterance, incorporated as the portion of the transcription based on the plurality of additional speech hypotheses (Based on the second context information and to the exclusion of the first context information, the method can select one or more of the multiple hypotheses for the second spoken input as one or more likely intended hypotheses for the second spoken input, and the method can include sending the one or more likely intended hypotheses for the second spoken input to the computing device, see col. 2 lines 9-17).  
Regarding claim 9 Casado teaches the method of claim 8, wherein modifying the given speech hypothesis incorporated as the portion of the transcription based on the plurality of additional speech hypotheses comprises: 
selecting an alternate speech hypothesis, from among the plurality of speech hypotheses, based on the respective confidence level associated with each of the plurality of speech hypotheses and based on the plurality of additional speech hypotheses (For each of the first and second transcription requests, the method can include determining, based on the respective information that characterizes the respective spoken input, a plurality of possible textual transcriptions for the respective spoken input, and selecting, based on the respective context information, one or more of the plurality of possible textual transcriptions as likely intended textual transcriptions for the respective spoken input, see col. 2 lines 49-56).; 
and replacing the given speech hypothesis with the alternate speech hypothesis, for one or more of the parts of the spoken utterance, in the transcription (the context information can be used to bias the scores of the one or more hypotheses that were determined at operation B (134) so that the hypotheses are re-ranked based on the context. For instance, FIG. 1B shows that "Cameron Callie" is promoted from being a relatively low-confidence hypothesis in the initial scores 136 to being the highest-ranked, and highly confident, hypothesis in table 140 as a result of weighting the hypotheses using current context from the recent calls list, see col. 7 lines 9-17).  
Regarding claim 10 Casado teaches method of claim 9, further comprising: selecting, from among one or more of the additional speech hypotheses, an additional given speech hypothesis, the additional given speech hypothesis being predicted to correspond to one or more of the additional parts of the additional spoken utterance (For each of the first and second transcription requests, the method can include determining, based on the respective information that characterizes the respective spoken input, a plurality of possible textual transcriptions for the respective spoken input, and selecting, based on the respective context information, one or more of the plurality of possible textual transcriptions as likely intended textual transcriptions for the respective spoken input, see col. 2 lines 48-56); 
and 58Attorney Docket No. ZS202-20835 causing the additional given speech hypothesis to be incorporated as an additional portion of the transcription, wherein the additional portion of the transcription positionally follows the portion of the transcription (The user 122 can subsequently direct the client device 102 to send additional transcription requests to the transcription system 106. Each time the client device 102 submits a request, context information may be provided with the request. Context information submitted with other requests is generally not used by the transcription system 106 to process additional requests, see col. 9 lines 4-6). 
Regarding claim 11 Casado teaches the method of claim 1, further comprising: generating a finite state decoding graph that includes a respective confidence level associated with each of the plurality of speech hypotheses based on the values generated using the ASR model, wherein selecting the given speech hypothesis, from among the plurality of speech hypotheses, is based on the finite state decoding graph (The word lattice 400 is represented here as a finite state transducer, see col. 14 lines 19-27).  
Regarding claim 12 Casado teaches the method of claim 11, wherein storing the plurality of speech hypotheses in the memory that is accessible by at least the computing device comprises storing the finite state decoding graph in the memory that is accessible by at least the computing device (One manner in which the initial hypotheses can be determined is by using a word lattice, as shown in FIG. 4, see col. 14 lines 19-26).  
Regarding claim 13 Casado teaches the method of claim 11, further comprising: receiving, via one or more additional microphones of the additional computing device (Each device 102a-c may have an integrated microphone, external microphone, or other means for capturing spoken input from a user, see col. 4, lines 50-52), additional audio data corresponding to an additional spoken utterance of the user (second transcription request, see col. 2 lines 1-10); 
processing, using the ASR model, the additional audio data corresponding to the additional spoken utterance to generate one or more additional speech hypotheses based on additional values generated using the ASR model ( The context of the computing device that defines both the first context information and the second context information may be unchanged between a time when the computing device submits the first request and a later time when the computing device submits the second request such that the first context information is equivalent to the second context information., see col. 2 lines 30-26); 
and modifying the given speech hypothesis, for one or more of the parts of the spoken utterance, incorporated as the portion of the transcription based on one or more of the additional speech hypotheses (For each of the first and second transcription requests, the method can include determining, based on the respective information that characterizes the respective spoken input, a plurality of possible textual transcriptions for the respective spoken input, and selecting, based on the respective context information, one or more of the plurality of possible textual transcriptions as likely intended textual transcriptions for the respective spoken input, see col. 3, lines 37-49).  
Regarding claim 14 Casado teaches the method of claim 13, wherein modifying the given speech hypothesis incorporated as the portion of the transcription based on one or more of the additional speech hypotheses comprises: 59Attorney Docket No. ZS202-20835 adapting the finite state decoding graph based on one or more of the additional speech hypotheses to select an alternate speech hypothesis from among the plurality of speech hypotheses (Additional techniques for selecting one or more likely intended hypotheses based on current context can also be used. In some implementations, the transcription server 106 can identify which, if any, of the initial hypotheses are associated with the context information and re-score only these identified hypotheses, see col. 7 lines 29-34); 
and replacing the given speech hypothesis with the alternate speech hypothesis, for one or more of the parts of the spoken utterance, in the transcription (the context information can be used to exclude certain hypotheses from consideration as candidates for responding to a transcription request, see col. 7 lines 34-36).  
Regarding claim 16 Casado teaches the method of claim 13, further comprising: causing the computing device to visually render on or more graphical elements that indicate the given speech hypothesis, for one or more of the parts of the spoken utterance, was modified (At operation E (144), the transcription system 106 sends the one or more selected context-dependent hypotheses over the network and to the client device 102. Upon receiving the selected hypotheses, the client device 102 can take appropriate action. For example, because the user 122 provided the spoken input in a phone application, the device 102 can select "Cameron Callie" and automatically initiate a phone call to "Cameron Callie" using a known telephone number. The device 102 may also prompt the user to query whether the selected hypothesis is correct so that the user can confirm the transcription and whether to place a call to "Cameron Callie." In some examples, the transcription server 106 may return multiple hypotheses to the client device 102. The device 102 may then take action based on the highest-ranked result, or may present a list of the hypotheses to the user to enable the user to select one or more hypotheses from the list, see col. 8, lines 45-57).  
Regarding claim 17 Casado teaches the method of claim 1, wherein transmitting the plurality of speech hypotheses comprises: subsequent to causing the given speech hypothesis to be incorporated as the portion of the transcription associated with the software application: determining the transcription associated with the software application is accessed at the additional computing device (plurality of client computing devices 102a-c , each device may determine a context that may be relevant to the substance of the spoken input, see col. 4 lines 41-56); 
and causing the plurality of speech hypotheses, for one or more of the parts of the spoken utterance, to be transmitted to the additional computing device and from the memory that is accessible by at least the computing device (he transcription system 106 transmits the selected hypotheses to the client devices 102a-c, and in conjunction with transmitting the selected hypotheses, see col 4 lines 51-64).  
Regarding claim 18 Casado teaches the method of claim 17, wherein the software application is associated with a third- party system, and wherein causing the plurality of speech hypotheses to be transmitted to the additional computing device comprises transmitting the plurality of speech hypotheses to the third-party system (the applications 222 may include third-party applications that are installed on the client device 202 (e.g., games, a preferred e-mail client or web browser, social networking applications), core applications that come pre-installed on the device 202 and that may be associated with an operating system on the client computing device 202 (e.g., phone or messaging applications, device contact managers, etc.), and web applications such as scripts or applets downloaded from a remote service, see col. 10, lines 61-67).  
Regarding claim 19 Casado teaches a method implemented by one or more processors, the method comprising: 
receiving, via one or more microphones of a computing device of a user, audio data corresponding to a spoken utterance of the user(the client device 212 is configured to receive spoken input. In some implementations, the input is received through a microphone 212 that is integrated in a body of the device 212, or that is otherwise connected to the device 212. When a user speaks, the device detects the speech as spoken input from the microphone 212, and stores information such as raw or compressed digital samples of the speech in speech buffer 214, see col. 9, lines 49-55); 
processing, using an automatic speech recognition (ASR) model, the audio data corresponding to the spoken utterance to generate, for one or more parts of the spoken utterance, a plurality of speech hypotheses based on values generated using the ASR model (the automatic speech recognizer ("ASR") 204 is configured to generate one or more transcription hypotheses in response to a request from the client computing device 202, see col. 11, lines 50-58); 
selecting, from among the plurality of speech hypotheses, a given speech hypothesis, the given speech hypothesis being predicted to correspond to one or more of the parts of the spoken utterance based on the values ( The context information may then be used to limit the corpus of possible terms that the speech recognizer 204 uses to identify transcription hypotheses, for example, to commands, numbers, and times used by the alarm clock application, see col. 11, lines 28-33); 
causing the given speech hypothesis to be incorporated as a portion of a transcription, the transcription being visually rendered at a user interface of the computing device of the user (The device 102 may then take action based on the highest-ranked result, or may present a list of the hypotheses to the user to enable the user to select one or more hypotheses from the list., see col. 57-60);
determining that the spoken utterance is complete (upon receiving spoken input, see col. 4 lines 52-56); 
in response to determining that the spoken utterance is complete, storing one or more alternate speech hypotheses in memory that is accessible by the computing device, the one or more of alternate speech hypotheses including a subset of the plurality of speech hypotheses that excludes at least the given speech hypothesis (the context information can be used to exclude certain hypotheses from consideration as candidates for responding to a transcription request, see col. 7 lines 34-36); 
receiving, via one or more of the microphones of the computing device, additional audio data corresponding to an additional spoken utterance of the user (receiving, by the computer system at a later time, a second request to transcribe spoken input from the user of the computing device, see col. 2, lines 1-3); 
in response to receiving the additional audio data, loading one or more of the alternate speech hypotheses from the memory that is accessible by the computing device (the one or more hypotheses to be sent to the client device in response to the request, see col. 3 lines 52-55); 
causing an additional given speech hypothesis to be incorporated as an additional portion of the transcription, wherein the additional given speech hypothesis is selected, from 61Attorney Docket No. ZS202-20835 among one or more additional speech hypotheses predicted to correspond to one or more additional parts of the additional spoken utterance (receiving, by a server system, a first transcription request and a second transcription request, each of the first and second transcription requests including (i) respective information that characterizes respective spoken input from a user of a computing device, and (ii) respective context information associated with the user or the computing device, present a list of the hypotheses to the user see col. 8 lines 45-50); 
and modifying, based on the additional given speech hypothesis, the portion of the transcription predicted to correspond to one or more of the parts of the spoken utterance to include a given alternate speech hypothesis, from among the one or more alternate speech hypotheses (Based on the second context information and to the exclusion of the first context information, the method can select one or more of the multiple hypotheses for the second spoken input as one or more likely intended hypotheses for the second spoken input, and the method can include sending the one or more likely intended hypotheses for the second spoken input to the computing device, see col. 2 lines 9-17).   
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 5, 6,  is/are rejected under 35 U.S.C. 103 as being unpatentable over Casado U.S. PAP 8,862,467 B1, in view of Roy U.S. Patent No. 11,295, 745.
Regarding claim 5 Casado teaches method of claim 4, further comprising: graphically demarcating the portion of the transcription that includes the part of the spoken utterance corresponding to the given speech hypothesis, wherein graphically demarcating the portion of the transcription is in response to determining that the respective confidence level for each of the plurality of speech hypotheses, for the part of the spoken utterance, fail to satisfy a threshold confidence level.  
In the same field of endeavor Roy teaches the n-best list may only include entries for domains having a confidence score satisfying  a minimum threshold confidence score. Alternatively, the shortlister component 350 may include entries for all domains associated with user enabled skills, even if one or more of the domains are associated with confidence scores that do not satisfy the minimum threshold confidence score, see col. 22 lines 5-13.
It would have been obvious to one of ordinary skill in the art to combine the Casado invention with the teachings of Roy for the benefit of improving human-machine interactions, see col. 1 lines 22-25.
Regarding claim 6 Casado teaches the method of claim 5, wherein graphically demarcating the portion of the transcription that includes the part of the spoken utterance corresponding to the given speech hypothesis comprises one or more of: highlighting the portion of the transcription, underlining the portion of the transcription, italicizing the portion of the transcription, or providing a selectable graphical element that, when selected, causes one or more additional speech hypotheses, from among the plurality of speech hypotheses, and that are in addition to the given speech hypothesis, to be visually rendered along with the portion of the transcription (To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having a display device for displaying information to the user and a keyboard and a pointing device by which the user can provide input to the computer, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input, see col. 18 lines 56-67).  
Regarding claim 15 Casado does not teach the method of claim 13, further comprising: selecting, from among one or more of the additional speech hypotheses, an additional given speech hypothesis, the additional given speech hypothesis being predicted to correspond to an additional portion of the additional spoken utterance; and causing the additional given speech hypothesis to be incorporated as an additional portion of the transcription, wherein the additional portion of the transcription positionally follows the portion of the transcription.  
IN the same field of endeavor Roy teaches in FIG. 2B as illustrated shows specific components of the ASR component 250. As noted above, the ASR component 250 transcribes audio data into text data representing the words of the speech contained in the audio data. The text data may then be used by other components for various purposes, such as executing system commands, inputting data, etc, see col. 11, lines 47-55. Figure 8B is a signal flow diagram illustrating how a user request requiring pausing of a skill is processed according to embodiments of the present disclosure. The orchestrator 230 may send (822) audio data to the ASR component 250. The audio data may represent a subsequent utterance from the user, for example, “what is the time?” or “pause trivia.” The ASR component 250 may determine text data corresponding to the audio data, as described above, and may send (824) text data to the orchestrator 230. The orchestrator 230 may send (826) the text data and other related data to the NLU component 260, see col. 42 lines 53-63.
It would have been obvious to one of ordinary skill in the art to combine the Casado invention with the teachings of Roy for the benefit of improving human-machine interactions, see col. 1 lines 22-25.



Regarding claim 20 Casado teaches a method implemented by one or more processors, the method comprising: 
receiving, via one or more microphones of a computing device of a user, audio data corresponding to a spoken utterance of the user(the client device 212 is configured to receive spoken input. In some implementations, the input is received through a microphone 212 that is integrated in a body of the device 212, or that is otherwise connected to the device 212. When a user speaks, the device detects the speech as spoken input from the microphone 212, and stores information such as raw or compressed digital samples of the speech in speech buffer 214, see col. 9, lines 49-55); 
processing, using an automatic speech recognition (ASR) model, the audio data corresponding to the spoken utterance to generate, for one or more parts of the spoken utterance, a plurality of speech hypotheses based on values generated using the ASR model (the automatic speech recognizer ("ASR") 204 is configured to generate one or more transcription hypotheses in response to a request from the client computing device 202, see col. 11, lines 50-58); 
selecting, from among the plurality of speech hypotheses, a given speech hypothesis, the given speech hypothesis being predicted to correspond to one or more of the parts of the spoken utterance based on the values; causing the given speech hypothesis to be incorporated as a portion of a transcription, the transcription being associated with a software application that is accessible by at least the computing device, and the transcription being visually rendered at a user interface of the computing device of the user (receiving, by a server system, a first transcription request and a second transcription request, each of the first and second transcription requests including (i) respective information that characterizes respective spoken input from a user of a computing device, and (ii) respective context information associated with the user or the computing device, present a list of the hypotheses to the user see col. 8 lines 45-50).
However Casado does not teach storing the plurality of speech hypotheses in memory that is accessible by at least the computing device, wherein storing the plurality of speech hypotheses in memory that is accessible by at least the computing device causes, in response to the software application being deactivated and subsequently activated at the computing device of the user, the software application to load the plurality of speech hypotheses.
In the same field of endeavor Roy teaches the post-NLU ranker 265 may then cause the system to solicit the user for an indication that the system is permitted to cause the transactional skill 290 to execute the user input. The user-provided indication may be an audible indication or a tactile indication (e.g., activation of a virtual button or input of text via a virtual keyboard). In response to receiving the user-provided indication, the system may provide the transactional skill 290 with data corresponding to the indication. In response, the transactional skill 290 may execute the command (e.g., book a flight, book a train ticket, etc.). Thus, while the system may not further engage an informational skill 290 after the informational skill 290 provides the post-NLU ranker 265 with result data 430, the system may further engage a transactional skill 290 after the transactional skill 290 provides the post-NLU ranker 265 with result data 430 indicating the transactional skill 290 may execute the user input, see col. 35 lines 1-25.
It would have been obvious to one of ordinary skill in the art to combine the Casado invention with the teachings of Roy for the benefit of improving human-machine interactions, see col. 1 lines 22-25.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Pertinent prior art available on form 892.
James ‘657 teaches biasing voice to text conversion using context sensitive data provided by a third-part, see abstract.
Sadkin ‘646 teaches  storing transcribed text in a memory of a computer, and using it to determine a match between it and a word hypothesis from input speech, see abstract.
Cooper ‘671 teaches systems for analyzing voice recognition results which lists the analyzed results and uses user information to make changes to the lists, see abstract.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Michael Ortiz-Sanchez whose telephone number is (571)270-3711. The examiner can normally be reached Monday- Friday 9AM-6PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/MICHAEL ORTIZ-SANCHEZ/Primary Examiner, Art Unit 2656