Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION
This office action is in response to application 16/712,567, which was filed 12/12/19. In a preliminary amendment 10/27/20, Applicant amended claims 1-3, 5-12, and 14-20. Claims 1-20 are pending in the application and have been considered.

Specification
The title of the invention is not descriptive.  A new title is required that is clearly indicative of the invention to which the claims are directed. 
The following title is suggested: Speech Processing using First and Second Embedding Data Representing Speech Characteristics.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to 


Claims 1, 2, 4-6, 12, and 13-15 are rejected under 35 U.S.C. 103 as being unpatentable over Stanton et al. (2021/0035551) in view of Flores et al. (2017/0244834). Stanton qualifies as prior art under 35 U.S.C. 102(a)(2) because it claims priority to provisional application 62/822,511, which was filed on 08/03/19, which is before Applicant’s effective filing date of 12/12/19.

Consider claim 1, Stanton discloses a computer-implemented method comprising: determining, using a user device, first audio data corresponding to an utterance (a reference audio signal 402, [0063]); processing, using a feature extraction component, the first audio data to determine first embedding data representing first vocal characteristics of a user who spoke the utterance (prosody embedding, [0063], Fig 4); processing, using a feature conversion component, the first embedding data to determine second embedding data representing second vocal characteristics representing a synthesized voice (the prosody embedding output from the reference encoder 400 is used for style embedding 550, Fig 5A, [0066]); sending, to a remote system, the second embedding data (to TTS model 650, Fig 5A).
Stanton does not specifically mention receiving, from the remote system, user identification data corresponding to the user; and processing the first audio data and the user identification data to determine a response to the utterance. 
Flores discloses receiving, from a remote system, user identification data corresponding to a user (receiving a call from and identifying a user, [0035]-0037], noting that the IVR system can be a plurality of linked servers, [0071], and determining whether the user has a profile, [0041], [0043]); and processing the first audio data and the user identification data to determine a response to the utterance (generating a TTS response using the user input and profile information, [0049]).


Consider claim 4, Stanton discloses computer-implemented method comprising: determining, using a user device, audio data corresponding to an utterance (a reference audio signal 402, [0063]); processing the audio data to determine first embedding data representing first audio characteristics of the utterance (prosody embedding, [0063], Fig 4); processing the first embedding data to determine second embedding data representing second audio characteristics corresponding to synthesized speech processing (the prosody embedding output from the reference encoder 400 is used for style embedding 550, Fig 5A, [0066]); sending, to a remote system, the second embedding data (to TTS model 650, Fig 5A).
Stanton does not specifically mention receiving, from the remote system, data corresponding to a user who spoke the utterance. 
Flores discloses receiving, from the remote system, data corresponding to a user who spoke the utterance (identifying and accessing the user profile 130 for the user 150, [0043]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Stanton by receiving, from the remote system, data corresponding to a user who spoke the utterance for reasons similar to those for claim 1.

Consider claim 13, Stanton discloses system comprising: at least one processor (processor, [0090]); and at least one memory including instructions (memory, [0094]) that, when executed by the at 
Stanton does not specifically mention receiving, from the remote system, data corresponding to a user who spoke the utterance. 
Flores discloses receiving, from the remote system, data corresponding to a user who spoke the utterance (identifying and accessing the user profile 130 for the user 150, [0043]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Stanton by receiving, from the remote system, data corresponding to a user who spoke the utterance for reasons similar to those for claim 1.

Consider claim 2, Stanton does not, but Flores discloses: outputting, using the user device, a first prompt requesting that the user say something (“please say yes or no”, [0049-0051]); determining, using the user device, third audio data corresponding to a first representation of a word the user said (using speech recognition, [0040]); outputting, using the user device, a second prompt requesting that the user say the same thing again (repeatedly prompt the user “please say yes or no”, [0049-0051]); determining, using the user device, fourth audio data corresponding to a second representation of the word (using speech recognition, [0040]); and using the third audio data and the fourth audio data by a text-to-speech (TTS) component (generating output TTS content based on a simple repeatedly prompt the user “please say yes or no”, [0049-0051], which responses are identified using speech recognition, [0040]). 

Consider claim 5, Stanton does not, but Flores discloses determining, using the user device, text data corresponding to the utterance (using speech recognition, [0040]); determining, using the text data and the data corresponding to the user, a response to the utterance (generating a TTS response using the user input and profile information, [0049]); and causing, using the user device, an output corresponding to the response (generating a TTS response using the user input and profile information, [0049]). 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Stanton by determining, using the user device, text data corresponding to the utterance; determining, using the text data and the data corresponding to the user, a response to the utterance; and causing, using the user device, an output corresponding to the response for reasons similar to those for claim 1.
Consider claim 6, Stanton discloses: selecting stored text data (input text 104 as received as input, stored, and processed, [0038-0039], [0099]); processing, using a text-to-speech neural network, the stored text data and the second embedding data to determine second audio data (sequence to 

Consider claim 12, Stanton does not, but Flores discloses determining, using the user device, text data corresponding to the utterance (using speech recognition, [0040]); sending, to the remote system, the text data (IVR system components can be a plurality of linked servers, [0071]); receiving, from the remote system, output data representing a response to the utterance (generating a TTS response using the user input and profile information, [0049]); and causing, using the user device, an output corresponding to the output data (generating a TTS response using the user input and profile information, [0049]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Stanton by determining, using the user device, text data corresponding to the utterance; sending, to the remote system, the text data; receiving, from the remote system, output data representing a response to the utterance; and causing, using the user device, an output corresponding to the output data for reasons similar to those for claim 1.

Consider claim 14, Stanton does not, but Flores discloses determining, using the user device, text data corresponding to the utterance (using speech recognition, [0040]); determining, using the text data and the data corresponding to the user, a response to the utterance (generating a TTS response using the user input and profile information, [0049]); and causing, using the user device, an output corresponding to the response (generating a TTS response using the user input and profile information, [0049]). 

Consider claim 15, Stanton discloses: selecting stored text data (input text 104 as received as input, stored, and processed, [0038-0039], [0099]); processing, using a text-to-speech neural network, the stored text data and the second embedding data to determine second audio data (sequence to sequence neural network, [0039]); and sending, to the remote system, the second audio data (user may input text, a remote search engine fetches a response to be synthesized into expressive speech for output from the computing device, [0088]).
Claims 7 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Stanton et al. (2021/0035551) in view of Flores et al. (2017/0244834), in further view of Roblek et al. (2015/0294670).

Consider claim 7, Stanton and Flores do not, but Roblek discloses: prior to processing the audio data, outputting, using the user device, a request to utter a word (prompting a user to speak multiple utterances, [0075]); determining, using the user device, second audio data corresponding to the word (feature scores, [0075]); and processing the second audio data to train a neural network (using the feature scores to train the neural network, [0075]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Stanton and Flores such that prior to processing the audio data, outputting, using the user device, a request to utter a word; determining, using the user 

Consider claim 16, Stanton and Flores do not, but Roblek discloses: prior to processing the audio data, outputting, using the user device, a request to utter a word (prompting a user to speak multiple utterances, [0075]); determining, using the user device, second audio data corresponding to the word (feature scores, [0075]); and processing the second audio data to train a neural network (using the feature scores to train the neural network, [0075]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the invention of Stanton and Flores such that prior to processing the audio data, outputting, using the user device, a request to utter a word; determining, using the user device, second audio data corresponding to the word; and processing the second audio data for reasons similar to those for claim 7.
Allowable Subject Matter
Claims 3, 8-11, and 17-20 are objected to as being dependent on a rejected base claim, but would be allowable if rewritten in independent form including all limitations of the base and any intervening claims.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Stanton et al. (“Predicting Expressive Speaking Style from Text in End-To-End Speech Synthesis”. arXiv:1808.01410v1 [cs.CL] 4 Aug 2018) appears to be the NPL publication upon 
2021/0097427 Clark discloses training neural networks to generate structured embeddings
2020/0380980 Shum discloses voice identification in digital assistant systems
2019/0392842 Khoury discloses end-to-end speaker recognition using a deep neural network
10,140,973 Dalmia discloses text-to-speech processing using previously speech processed data
7,869,998 Di Fabbrizio discloses a voice-enabled dialog system

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Jesse Pullias whose telephone number is 571/270-5135. The examiner can normally be reached on M-F 8:00 AM - 4:30 PM. The examiner’s fax number is 571/270-6135.

Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

If attempts to reach the examiner by telephone are unsuccessful, the examiner's supervisor, Dan Washburn can be reached on 571/272-5551. 

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available 


/Jesse S Pullias/
Primary Examiner, Art Unit 2657                                          07/26/21