DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Election/Restrictions
Claims 1 to 4 are withdrawn from further consideration pursuant to 37 CFR 1.142(b), as being drawn to a nonelected invention, there being no allowable generic or linking claim.  Applicants timely traversed the restriction (election) requirement in the reply filed on 11 December 20202.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 5, 10, 13, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Sakai (U.S. Patent Publication 2002/0055843) in view of Ye et al. (U.S. Patent Publication 2020/0012675).
Concerning independent claims 5 and 13, Sakai discloses a method and system for voice synthesis, comprising:
“receiving first input data representing a vocal characteristic and a request to output content corresponding to the vocal characteristic” – a service provider furnishes 
“processing the first input data to determine vocal characteristic data representing at least the vocal characteristic” – to generate voice synthesis data, voice synthesizer 61 extracts from contents DB 52 data indicating a speaker designated in the order received from customer 3 (“the first input data”), extracts the voice data, i.e., voice quality data D1 and prosody data D2, for this speaker from voice characteristic DB 62 (“to determine vocal characteristic data representing at least the vocal characteristic”), and extracts from contents DB 52 a sentence designated by customer 3 (¶[0054]: Figure 2); 

Concerning independent claims 5 and 13, Sakai discloses that a service provider furnishes a list of multiple speakers via a network to a remote user, and the customer transmits an identity of a speaker that is selected from the list.  (¶[0011])  A customer provides input data through a screen generated by screen data generator 13 from HTTP server 11.  (¶[0041]: Figure 1)  Sakai, then, generally discloses that a user selects a speaker from a list to identify voice characteristics of speech to be synthesized, e.g., via a mouse on a graphic user interface.  However, Sakai omits “performing natural language understanding (NLU) processing using the first input data to determine an intent to perform speech synthesis corresponding to the vocal characteristic represented in a portion of the first input data” and then performing speech synthesis “based at least in part on determining the intent to perform speech synthesis corresponding to the vocal characteristic represented in the portion of the first input data”.  That is, Sakai does not use “natural language understanding (NLU) processing” to “determine an intent” to perform speech synthesis with a selected voice characteristic selected by input data.  Instead, Sakai simply selects a voice characteristic for speech synthesis from a list on a graphical user interface without using “natural language 
Concerning independent claims 5 and 13, Ye et al. teaches an analogous art method and system for processing a voice request and for determining a target multimedia resource requested of a preset multimedia resource library to be played in a voice request.  (Abstract)  An analyzing unit is configured to perform an intent analysis on an acquired voice request to determine the target multimedia resource requested to be played in the voice request.  (¶[0014])  The user may send a request ‘playing a song of Chinese rock style’ or ‘I want to listen to the theme song of Titanic’.  The method of processing a voice request may include performing an intent analysis on the acquired voice request.  Specifically, a semantic analysis may be performed on the text corresponding to the voice request using a natural language processing technology, and the intent of the user sending the voice request is acquired.  (¶[0043] - ¶[0044])  An executing body may analyze the voice request received through a webpage.  The voice request may be converted into a text message, and then analyzed using a natural language processing technology to obtain the intent of the user.  (¶[0080])  Ye et al., then, teaches a known alternative in analogous prior art that a selection to play a multimedia resource can be performed by natural language processing of a voice request including an analysis of intent which is a known alternative to simply selecting a multimedia resource through a user interface of Sakai.  An objective is to provide a smart voice service with artificial intelligence technology to play music in a music resource library where it might be difficult for a voice server to provide a resource Sakai using natural language processing and analysis of an intent as taught by Ye et al. as an art recognized alternative way of selecting a multimedia resource with artificial intelligence when resources are limited.

Concerning claims 10 and 18, Sakai discloses attaching to the voice synthesis data verification data that verifies the contents of the voice synthesis data so that illegal generation or illegal copying of the voice synthesis data can be prevented (¶[0021]); a watermark engine 60 embeds an electronic watermark (verification data) in the voice synthesis data to verify that the voice synthesis data is authenticated, i.e., a permission is obtained from the holder of the voice source right (¶[0056]: Figure 3: Step S4).  Here, producing an electronic watermark verifying that the voice synthesis data is authenticated is “determining identification data that the synthesized data includes a representation of synthesized speech”, adding the watermark to the synthesized speech is “determining modified synthesized speech data by processing the identification data with the synthesized speech data”, and synthesizing the speech with the watermark data is “sending the modified synthesized speech data.”  That is, an electronic watermark verifying that the voice synthesis data is authenticated provides “identification data” of “a representation of synthesized speech.”  Applicants’ Specification, ¶[0045], describes this embodiment as placing a tone or tones outside the frequency range of human hearing to identify the audio as synthesized speech.  

s 6, 14, 21, 23, 24, and 26 are rejected under 35 U.S.C. 103 as being unpatentable over Sakai (U.S. Patent Publication 2002/0055843) in view of Ye et al. (U.S. Patent Publication 2020/0012675) as applied to claims 5 and 13 above, and further in view of McDuff et al. (U.S. Patent Publication 2020/0279533).
Concerning claims 6 and 14, Sakai discloses using prosody data to generate synthesized speech, where voice characteristic data comprises voice quality data D1 and prosody data D2.  (¶[0018]; ¶[0051] to ¶[0052]: Figures 2 to 3)  Sakai does not expressly disclose determining that “first input data comprises audio data”, “determining that the audio data represents an utterance”, and processing the prosody data “using a trained model” to determine a portion of the vocal characteristic data.  That is, Sakai omits that the input data comprises “audio data” in a request and a conventional voice activity detector to “determine that the audio data represents an utterance”, where the prosody data is processed using “a trained model”.  However, McDuff et al. teaches a linguistic style matching agent to match the speech and facial expressions of a user.  (Abstract)  Generally, Applicants’ claim language appears to encompass two distinct embodiments for selecting a vocal characteristic for speech synthesis with first input data: (1) an express request in the first input data to output content with a vocal characteristic, e.g., via a natural language description of how to synthesize the speech, or (2) an implied request that uses speech of the user as the first input data to select a vocal characteristic that is similar to or corresponds to a speech of the user.  Here, McDuff et al. teaches the latter embodiment.  Specifically, McDuff et al. teaches
“determining that the audio data represents an utterance” – voice activity recognizer 204 processes microphone input 202 to extract voiced segments (¶[0027]: 
“processing the audio data to determine prosody data” – output from voice activity recognizer 204 is provided to a prosody recognizer 208 that performs paralinguistic parameter recognition on audio segments containing voice activity (¶[0029]: Figure 2);
“processing the prosody data using a trained model to determine a portion of the vocal characteristic data” – microphone input 202 that corresponds to voice activity is passed to speech recognizer 206, which may use a deep feedforward neural network or a recurrent neural network (¶[0028]: Figure 2); paralinguistic parameters extracted by voice activity recognizer 204 may include speech rise, fundamental frequency (f0), which is perceived as pitch, root mean squared (RMS) energy, and speech energy (“the vocal characteristic data”) (¶[0029]: Figure 2); Figure 2 illustrates that prosody recognizer 208 is a neural network (“a trained model”) similar to a neural network of speech recognizer 206.  An objective is to provide a voice interface that exhibits similar social behavior to humans that is not robotic or unnatural so as to not disappoint users.  (¶[0003])  It would have been obvious to one having ordinary skill in the art to determine that audio data represents an utterance and to process prosody data using a trained model as taught by McDuff et al. to provide speech synthesis based on a voice characteristic chosen by a customer in Sakai for a purpose of producing a voice interface that exhibits similar social behavior to humans that is not robotic or unnatural.

 
 McDuff et al. teaches linguistic style matching to match the speech expression of a user (Abstract); dialogue manager 216 attempts to adjust the content of an utterance to order to more closely match the conversational style of user 102 (“the vocal characteristics to be determined to match a characteristic of the speech”) (¶[0039]: Figure 2); a custom intent recognizer 214 recognizes intents in speech identified by speech recognizer 206 (“determining the intent”); intent recognition identifies one or more intents in natural language; an intent may be a ‘goal’ of user 102, e.g., booking a flight or finding out when a package will be delivered (¶[0035]: Figure 2).  
Concerning claims 23 and 26, McDuff et al. teaches that a conversational agent may be embodied with a face that may match the facial expressions of a user (Abstract); a custom intent recognizer 214 recognizes intents in speech identified by speech recognizer 206; intent recognition identifies one or more intents in natural language; an intent may be a ‘goal’ of user 102, e.g., booking a flight or finding out when a package will be delivered (¶[0035]: Figure 2); camera 306 captures images including images of a user 102 (“wherein the portion of the first input data comprises image data associated with the vocal characteristics”) (¶[0050]: Figure 4); conversational agent 302 may ‘mimic’ the facial expression and head pose of user 102; understanding of user’s 102 facial expressions and head pose begins with video input 410 captured by camera 106 (¶[0072]: Figure 4); facial expression recognizer 416 may return probabilities for several possible emotions including anger, disgust, fear, joy, sadness, surprise, and neutral; an emotion identified by facial expression recognizer 416 may be provided to conversational style manager 402 to modify the utterance of conversational agent 302 (¶[0075] - ¶[0076]: Figure 4); here, “the intent indicates the . 

Claims 11, 19, 22, and 25 are rejected under 35 U.S.C. 103 as being unpatentable over Sakai (U.S. Patent Publication 2002/0055843) in view of Ye et al. (U.S. Patent Publication 2020/0012675) as applied to claims 5 and 13, and further in view of Hirai (U.S. Patent No. 6,334,104).
Concerning claims 11 and 19, Sakai does not expressly disclose that “the first input data comprises a description of the vocal characteristic” and “determine a second vocal characteristic that is different from the description, wherein the vocal characteristic data further represents the second vocal characteristic.”  Generally, Sakai presupposes that there is a database of voice characteristics for a plurality of speakers, e.g., a celebrity, politician, or character appearing on a television program.  Sakai, then, discloses “a second vocal characteristic” for first and second speakers.  However, Sakai does not provide “a description” of a vocal characteristic as “the first input data”.  
Concerning claims 11 and 19, Hirai teaches a sound effects affixing device which enables sound effects to be affixed in relation to inputted sentences automatically.  (Abstract)  A natural language processing unit 1040 analyzes sentences, a characters characteristics extraction unit 1060 extracts characteristics of the characters who appear in the inputted sentences, and a speech synthesizing unit 1090 synthesizes speech using characteristics of the characters.  (Column 1, Lines 19 to 34: Figure 1)  Onomatopoeias, sound source names, and subjective words of sentences are obtained to select sound effects corresponding thereto.  (Column 2, Lines 35 to 38)  A sound Hirai, then, teaches an analogous way of using natural language processing to determine descriptions of sounds to be synthesized.  An objective is to faithfully affix sound effects to sound representations within text documents that is capable of being processed in a short time.  (Column 2, Lines 20 to 31)  It would have been obvious to one having ordinary skill in the art to provide a plurality of voice characteristics in Sakai with natural language processing of input data comprising a description of a characteristic as taught by Hirai for a purpose of affixing effects to sound representations that is capable of being processing in a short time.
Concerning claims 22 and 25, Hirai teaches “wherein the portion of the first input data comprises a description of the [vocal] characteristic”, and using natural language processing.  (Column 12, Lines 13 to 39: Figures 4 and 7 to 8)  Ye et al. teaches using .  

Allowable Subject Matter
Claims 8 to 9, 12, 16 to 17, 20, and 27 to 28 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
 The following is a statement of reasons for the indication of allowable subject matter:  
Concerning claims 8 and 16, the prior art of record does not appear to disclose or reasonably suggest determining using a natural language understanding component that third input data lacks a description of a vocal characteristic, and causing an indication of a request for the vocal characteristic to be sent to a local device.  Here, the prior art of record does not appear to use a natural language understanding component to make a determination that third input data lacks a description of the vocal characteristic.  Applicants’ Specification, ¶[0017] - ¶[0021] and ¶[0048] - ¶[0051], describes an embodiment of natural language descriptions of vocal characteristics, e.g. coarseness or speed, or ‘sounds like a professor’ or ‘childlike’ that may be determined by a natural language understanding component.  The prior art of record does not appear to disclose or reasonably suggest using natural language understanding to analyze descriptive labels to cause a request for vocal characteristics to be sent to a local device.
McDuff et al., the second encoder does not appear to be directed to processing second input data corresponding to a speech synthesis task.  Applicants’ Specification, ¶[0044] and ¶[0059]: Figures 3B and 7: Steps 734 to 736, appears to describe receiving second input data corresponding to a speech synthesis task and processing second text data using an encoder, where an encoder can be represented by a semantic encoder.  
Concerning claims 27 to 28, the prior art of record does not appear to disclose or reasonably suggest processing using a trained model the vocal characteristic data to determine a model weight and processing second input data corresponding to a speech synthesis task using an encoder.  Even if first and second encoders with a trained model and model weights are disclosed by McDuff et al., the second encoder does not appear to be directed to processing second input data corresponding to a speech synthesis task.  Applicants’ Specification, ¶[0044] and ¶[0059]: Figures 3B and 7: Steps 734 to 736, appears to describe receiving second input data corresponding to a speech synthesis task and processing second text data using an encoder, where an encoder can be represented by a semantic encoder.                                                                                                                                                                                




Response to Arguments
Applicants’ arguments filed 12 October 2021 have been considered but are moot in view of new grounds of rejection, as necessitated by amendment.
Applicants provide some significant amendments to independent claims 5 and 13, where they delete limitations directed to “processing, using a trained model, the vocal characteristics to determine a model weight”, “receiving second input data corresponding to a speech synthesis task”, “processing, using an encoder, the second input data to determine encoded data”, and “processing, using a decoder and the model weight, the encoded data”.  Applicants’ add only a new limitations directed to receiving “a request to output content corresponding to the vocal characteristic”.  Applicants argue that this amendment overcomes the prior rejection of these independent claims as being obvious under 35 U.S.C. §103 over Chae (U.S. Patent Publication 2020/0005764) in view of McDuff et al. (U.S. Patent Publication 2020/0279533).  Specifically, Applicants note a statement in the Office Action that the claim language is not directed to first input data that is a command or first input data that is text data that is a command for an artificial agent to speak in a particular way.  Applicants point to the Specification as describing that first input data can be a request to ‘sound like a professor’ or that a vocal characteristic determined by natural language understanding can include terms of ‘distinguished’ or ‘generate speech that sounds like this: “Hasta la vista, baby’” to include an accent of Arnold Schwarzenegger.  Applicants provide amendments to some of the dependent claims and add new dependent claims corresponding to some of the limitations deleted from the independent claims.
Sakai (U.S. Patent Publication 2002/0055843) in view of Ye et al. (U.S. Patent Publication 2020/0012675).  Generally, Applicants’ amendments have in some ways significantly broadened the scope of these independent claims.  However, “receiving first input data representing . . . a request to output content corresponding to the vocal characteristic” does refocus the invention in a direction that requires the first input data to include a ‘request’ instead of simply enabling speech synthesis to match a style of speech represented by speech of a user as first input data in McDuff et al.  The claim language, then, appears to expressly require that first input data includes a request to perform speech synthesis according to a vocal characteristic.  Applicants’ Specification, ¶[0018] and ¶[0048], does support this interpretation as a natural-language ‘description’ of a speaking style.  Still, this embodiment is not always completely consistent with what is required at least by some of the dependent claims directed to the first input data being image data in dependent claims 23 and 25.  The rejection of some of the dependent claims continues to rely upon McDuff et al.  The rejection no longer relies upon Chae.  New grounds of rejection are set forth as to some of the dependent claims being obvious in view of Hirai (U.S. Patent No. 6,334,104).  
Generally, Sakai and Ye et al. are maintained to render obvious a basic idea of the invention as set forth by the independent claims as amended.  Here, Sakai discloses that a user selects a voice characteristic of a speaker for voice synthesis by submitting a request for an identity of a speaker from a pre-determined list of speakers presented on a screen.  Specifically, Sakai is directed to submitting the request through Ye et al., then, teaches natural language processing to determine an intent in analogous art for a purpose of selecting a multimedia resource from a multimedia resource library according to a voice request.  One skilled in the art could see how an analogous way of selecting a multimedia resource of Ye et al. could instead be used to select a voice characteristic for speech synthesis in Sakai.  That is, it is only a matter of (A) Combining prior art elements according to known methods to yield predictable results; (C) Use of known technique to improve similar devices (methods, or products) in the same way; or (D) Applying a known technique to a known device (method, or product) ready for improvement to yield predictable results, to obtain the invention from what is disclosed and taught by Sakai and Ye et al. in accordance with KSR Int'l Co. v. Teleflex Inc., 550 U.S. 398, 415-421, 82 USPQ2d 1385, 1395-97 (2007).
New allowable subject matter is indicated for claims 8 to 9, 12, 16 to 17, 20, and 27 to 28.  Claims 1 to 4 remain withdrawn subject to a restriction requirement.
These new grounds of rejection are necessitated by amendment.  This Office Action is NON-FINAL.  

Conclusion
The prior art made of record and not relied upon is considered pertinent to Applicants’ disclosure.
Yun et al. and McCuller disclose related prior art. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to MARTIN LERNER whose telephone number is (571) 272-7608.  The examiner can normally be reached on Monday-Thursday 8:30 AM-6:00 PM.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached on (571) 272-5551.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair.  Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). 






/MARTIN LERNER/Primary Examiner
Art Unit 2657                                                                                                                                                                                                        October 21, 2021