DETAILED ACTION
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 06/20/2022 has been entered.
This communication is in response to the Amendments and Arguments filed on   05/24/2022. 
Claims 1, 3-8, 10-15, 17-23, and 25 are pending and have been examined.
All previous objections/rejections not mentioned in this Office Action have been withdrawn by the examiner. 
	Notice of Pre-AIA  or AIA  Status
The present application is being examined under the pre-AIA  first to invent provisions. 
Response to Arguments
Applicant’s arguments filed 05/24/2022 regarding the 101 rejections have been fully considered and are persuasive. The rejections have, therefore, been withdrawn.
Applicant's arguments with respect to claim(s) 1, 8, and 15 have been fully considered but they are not persuasive. Applicant asserts on page 10 that Nicolis does not suggest aligning phonetic units of a text prompt that do not correspond to paralinguistic features with an audio recording based on the paralinguistic features. The Examiner respectfully disagrees with this assertion. In response to applicant's argument that the references fail to show certain features of applicant’s invention, it is noted that the features upon which applicant relies (i.e., aligning the phonetic units based on the paralinguistic features) are not recited in the rejected claim(s).  Although the claims are interpreted in light of the specification, limitations from the specification are not read into the claims.  See In re Van Geuns, 988 F.2d 1181, 26 USPQ2d 1057 (Fed. Cir. 1993). Further, Nicolis teaches that the system determines portions of text that correspond to an inflection within an audio data based on a location of the word and the time of the inflected portion of audio. Additionally, Nicolis teaches that paralinguistic features are recognized as qualities or features other than words spoken, therefore, the determined correspondence of text to audio based on words excludes phonetic units corresponding to paralinguistic components (see Nicolis (11:28-41),(23:40-24:3)).
Hence, Applicant’s arguments are not persuasive.
Applicant’s arguments with respect to claim(s) 22 and 23 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. Please see the new mappings additionally citing Nicolis for further detail.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1, 5-8, 12-15, 19-23, and 25 is/are rejected under 35 U.S.C. 103 as being unpatentable over Wang et al. (US PG Pub No. 2010/0312565), hereinafter Wang, in view of Nicolis et al. (U.S. Patent No. 10319365), hereinafter Nicolis.

Regarding claim 1, 8, and 15, Wang teaches
(claim 1) A processor-implemented method for customizing the rendering of a synthesized speech prompt (a computer-implemented process, i.e. processor-implemented method [0021:1-4]), the method comprising:
(claim 8) A computer system for customizing the rendering of a synthesized speech prompt (a computing system [0021]), the computer system comprising:
(claim 8) one or more processors, one or more computer-readable memories, one or more computer-readable tangible storage medium, and program instructions stored on at least one of the one or more tangible storage medium for execution by at least one of the one or more processors via at least one of the one or more memories (embodiment implemented as a computer storage medium implemented as one or more memory or hard drive, or comparable media, i.e. one or more computer-readable memories, one or more computer-readable tangible storage medium, where the storage medium encodes a computer program comprising instructions readable by a computer system, i.e. program instructions stored on at least one of the one or more tangible storage medium, for causing a computer or computing system to perform processes, where the system can include a multiprocessor system, i.e. one or more processors… execution by at least one of the one or more processors via at least one of the one or more memories [0020-1]), wherein the computer system is capable of performing a method comprising:
(claim 15) A computer program product for customizing the rendering of a synthesized speech prompt (a computer program product [0021]), the computer program product comprising:
(claim 15) one or more computer-readable tangible storage medium and program instructions stored on at least one of the one or more tangible storage medium, the program instructions executable by a processor to cause the processor to perform a method comprising (embodiment implemented as a computer storage medium implemented as one or more memory or hard drive, or comparable media, i.e. one or more computer-readable tangible storage medium, where the storage medium encodes a computer program comprising instructions readable by a computer system, i.e. program instructions stored on at least one of the one or more tangible storage medium, for causing a computer or computing system to perform processes, where the system can include a multiprocessor system, i.e. program instructions executable by a processor to cause the processor to perform a method [0020-1]):
extracting a plurality of prosodic ... from a received audio recording of a prompt (the user is enabled to speak a desired output, which can be received through a voice input device, i.e. received audio recording of a prompt [0037],[0060:22-25], where key acoustic information, such as pitch variation, duration, and energy of each phoneme, is extracted from the user’s own voice, i.e. extracting a plurality of prosodic information [0033],[0037]);
separating a received text transcript of the prompt into phonetic units (the user may be enabled to input the text for the prompt, i.e. received text transcript of the prompt [0039], and processing may be performed on the text, where the text is divided and marked into prosodic units, such as text-to-phoneme conversion, i.e. separating...text...into phonetic units [0025]);
aligning the phonetic units ... with the audio recording based on the prosodic information (acoustic units from the user’s own voice, i.e. audio recording...based on the prosodic information, where each word broken down into phonemes is linked to the corresponding acoustic units, i.e. aligning the phonetic units...with the audio recording Fig. 4,[0037-41],[0049]); and
synthesizing, by a text-to-speech engine, speech for the prompt based upon the aligning, the plurality of prosodic information, ... (the key acoustic information, i.e. plurality of prosodic information, of the desired output, i.e. the prompt, and the acoustic units linked to the words broken down into phonemes, i.e. the aligning, is used to guide the text-to-speech engine, i.e. by a text-to-speech-engine, in generating the synthesized voice with similar prosody, i.e. synthesizing speech for the prompt [0032-3],[0037]).  
While Wang provides the extraction of, and synthesis of speech using, key acoustic information, Wang does not specifically teach that the acoustic information can include paralinguistic components, and thus does not teach
extracting a plurality of prosodic information and at least one paralinguistic component from a received audio...;
aligning the phonetic units that do not correspond to the at least one paralinguistic component with the audio recording based on the prosodic information; and 
synthesizing, by a text-to-speech engine, speech for the prompt based upon the aligning, the plurality of prosodic information, and the at least one paralinguistic component.
Nicolis, however, teaches extracting a plurality of prosodic information and at least one paralinguistic component from a received audio... (the present system may detect speech qualities in an utterance in the speech, i.e. extracting...from a received audio, where the qualities may include paralinguistic features including acoustic features such as prosody, pitch, speed, and energy, i.e. a plurality of prosodic information, as well as whether the speech includes a cough, sneeze, laugh, or other non-speech articulation, i.e. at least one paralinguistic component (11:28-41));
aligning the phonetic units that do not correspond to the at least one paralinguistic component with the audio recording based on the prosodic information (the system may determine the portion of the post-ASR text, i.e. aligning the phonetic units, that corresponds to an inflection within the audio data, i.e. audio recording, based on the location of the word and the time of the inflected portion determined through prosodic analysis, i.e. based on the prosodic information, where the words of the text are tagged for emphasis based on the audio data matching the text, and paralinguistic features are recognized as qualities or features other than the words spoken, i.e. that do not correspond to the at least one paralinguistic component (11:28-41),(23:40-24:3)); and 
synthesizing, by a text-to-speech engine, speech for the prompt based upon the aligning, the plurality of prosodic information, and the at least one paralinguistic component (the output text, i.e. prompt, is to be processed by a TTS engine and output as synthesized speech, i.e. synthesizing, by a text-to-speech engine, speech for the prompt (11:15-17), where the speech quality module may process the incoming audio data to determine certain characteristics, to classify one or more qualities of the speech, i.e. plurality of prosodic information, and the at least one paralinguistic component, and then alter the output operation in response to the one or more qualities, such as changing the synthesized output to match the way the input was said, i.e. based upon the plurality of prosodic information and the at least one paralinguistic component, and using the tags from matching audio data to the text for synthesis of text inflection or emphasis, i.e. aligning (5:13-35),(11:28-63),(23:38-24:20)).
Wang and Nicolis are analogous art because they are from a similar field of endeavor in synthesizing realistic TTS voices. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the extraction of, and synthesis of speech using, key acoustic information teachings of Wang with the further recognition and use of paralinguistic components as taught by Nicolis. It would have been obvious to combine the references to enable the selection of TTS pronunciation using input speech, including non-speech articulation commonly ignored by ASR systems, to result in a more exciting and realistic audio output and, therefore, better user experience (Nicolis (5:31-35,62-65),(11:49-57)).

Regarding claims 5, 12, and 19, Wang in view of Nicolis teaches claims 1, 8, and 15, and Wang further teaches
associating the prosodic information with the prompt by means of a unique customization identification (prompts may be processed as sessions, which can be named and saved, i.e. unique customization identification [0038-9], and the system stores sessions and TTS related information, such as prosody information used by the TTS engine core, i.e. associating the prosodic information with the prompt Figs. 4 and 6,[0033-4],[0051],[0055]). 

Regarding claims 6, 13, and 20, Wang in view of Nicolis teaches claims 5, 12, and 19, and Wang further teaches
the unique identification further comprises a context of the received audio recording (prompts may be processed as sessions, which can be named and saved, including a prompt type, i.e. unique customization identification further comprises a context [0038-9], where the user is enabled to speak the desired output for recording by the tool, i.e. received audio recording [0037]).  

Regarding claims 7, 14, and 21, Wang in view of Nicolis teaches claims 1, 8, and 15, and Wang further teaches 
adapting the prosodic information to match a text-to-speech voice (the prosody information, i.e. prosodic information, may be received by the TTS engine core, and the wave synthesizer may use prosody adjustment to save a voice font file for use by the TTS engine core, i.e. adapting…to match a text-to-speech voice [0033-4]).  

Regarding claim 22, Wang teaches
A method for synthesizing speech for a customized prompt, the method comprising (a computer-implemented process, i.e. method [0021:1-4]):
extracting stored prosodic information for the customized prompt corresponding with a received customization identification (prompts may be processed as sessions, which can be named and saved, i.e. received unique customization identification [0038-9], and the system stores sessions and TTS related information, such as prosody information used by the TTS engine core, i.e. extracting stored prosodic information for the customized prompt Figs. 4 and 6,[0033-4],[0051],[0055]);
synthesizing, by a text-to-speech engine, speech for the prompt based on the extracted prosodic information (the TTS engine core, i.e. text-to-speech-engine, may use saved prosody information, such as binary data of a voice font, i.e. based on the extracted prosodic information, to generate the synthesized voice with similar prosody, i.e. synthesizing...speech for the prompt [0032-3],[0037]).  
While Wang provides a user customization of the prosody of the output, Wang does not specifically teach a user-selected voice, and thus does not teach
adapting the prosodic information to match a user-selected particular text-to-speech voice.
Nicolis, however, teaches adapting the prosodic information to match a user-selected particular text-to-speech voice (the TTS storage may be customized for an individual user based on their individualized desired speech output, such as a speech output having a specific gender or accent, i.e. user-selected text-to-speech voice, as well as tagging the text with other detected qualities such as energy, volume, or speed, i.e. adapting the prosodic information, where a filter may be used to alter the TTS output to match the desired speech qualities, i.e. adapting...to match (11:28-41),(21:34-44),(22:20-28)).
Wang and Nicolis are analogous art because they are from a similar field of endeavor in synthesizing realistic TTS voices. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the extraction of, and synthesis of speech using, key acoustic information teachings of Wang with the customization of a TTS voice based on the desired speech output of the user by Nicolis. It would have been obvious to combine the references to enable speech synthesis models to account for user preferences (Nicolis (21:27-44)).




Regarding claim 23, Wang teaches
A method for extracting prosodic information from an audio recording of a prompt (a computer-implemented process, i.e. method [0021:1-4]), the method comprising:
parsing the plurality of received text corresponding with the prompt into one or more phonetic units (the user may be enabled to input the text for the prompt, i.e. plurality of received text corresponding with the prompt [0039], and processing may be performed on the text, where the text is divided and marked into prosodic units, such as text-to-phoneme conversion, i.e. parsing...into one or more phonetic units [0025]);
aligning the phonetic units ... with the audio recording (acoustic units from the user’s own voice, i.e. audio recording...based on the prosodic information, where each word broken down into phonemes is linked to the corresponding acoustic units, i.e. aligning the phonetic units...with the audio recording Fig. 4,[0037-41],[0049]); and
calculating, based on the alignment, one or more prosodic values for at least one of the plurality of phonetic units (the key acoustic information, such as pitch variation, duration, and energy, i.e. one or more prosodic values, is extracted from the recorded input speech for each phoneme, i.e. calculating, based on the alignment…for at least one of the plurality of phonetic units [0037-41]); and
converting, by a text to speech engine, the one or more prosodic values into speech (the key acoustic information, i.e. one or more prosodic values, of the desired output, and the acoustic units linked to the words broken down into phonemes, is used to guide the text-to-speech engine, i.e. by a text-to-speech-engine, in generating the synthesized voice with similar prosody, i.e. converting...into speech [0032-3],[0037]).  
While Wang provides the extraction of, and synthesis of speech using, key acoustic information, Wang does not specifically teach that the acoustic information can include paralinguistic components, and thus does not teach
aligning the phonetic units that do not correspond to at least one identified paralinguistic component with the audio recording;
calculating, based on the alignment, one or more prosodic values for at least one of the plurality of phonetic units.
Nicolis, however, teaches aligning the phonetic units that do not correspond to at least one identified paralinguistic component with the audio recording (the system may detect speech qualities in an utterance in the speech, where the qualities may include a cough, sneeze, laugh, or other non-speech articulation, i.e. at least one identified paralinguistic component (11:28-41), and the system may determine the portion of the post-ASR text, i.e. phonetic units, that corresponds to an inflection within the audio data, i.e. audio recording, based on the location of the word and the time of the inflected portion determined through prosodic analysis, where the words of the text are tagged for emphasis based on the audio data matching the text, i.e. aligning the phonetic units, and paralinguistic features are recognized as qualities or features other than the words spoken, i.e. that do not correspond to the at least one identified paralinguistic component (11:28-41),(23:40-24:3));
calculating, based on the alignment, one or more prosodic values for at least one of the plurality of phonetic units (the system may determine the portion of the post-ASR text that corresponds to an inflection within the audio data, based on the location of the word and the time of the inflected portion determined through prosodic analysis, i.e. calculating...one or more prosodic values for at least one of the plurality of phonetic units, where the words of the text are tagged for emphasis based on the audio data matching the text, i.e. based on the alignment (23:40-24:3)).
Wang and Nicolis are analogous art because they are from a similar field of endeavor in synthesizing realistic TTS voices. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the extraction of, and synthesis of speech using, key acoustic information teachings of Wang with the further recognition and use of paralinguistic components as taught by Nicolis. It would have been obvious to combine the references to enable the selection of TTS pronunciation using input speech, including non-speech articulation commonly ignored by ASR systems, to result in a more exciting and realistic audio output and, therefore, better user experience (Nicolis (5:31-35,62-65),(11:49-57)).

Regarding claim 25, Wang in view of Nicolis teaches claim 23, and Wang further teaches
wherein the one or more prosodic values enumerate one or more prosodic qualities of a phonetic unit selected from a list consisting of (the key acoustic information, i.e. one or more prosodic values, is extracted for each phoneme, i.e. enumerate one or more prosodic qualities of a phonetic unit [0037]):
a duration, a starting pitch, an ending pitch, a volume, and an additional speech feature (key acoustic information includes pitch variation, i.e. an additional speech feature, duration, and energy, i.e. volume, of each phoneme [0037]).

Claim(s) 3, 10, and 17 is/are rejected under 35 U.S.C. 103 as being unpatentable over Wang, in view of Nicolis, and further in view of Acker et al. (U.S. Patent No. 8918322), as found in the IDS, hereinafter Acker.

Regarding claims 3, 10, and 17, Wang in view of Nicolis teaches claims 1, 8, and 15. 
While Wang in view of Nicolis provides the recognition of a user speaking the prompt and receiving text input of the prompt to be synthesized, Wang in view of Nicolis does not specifically teach recognizing part of the prompt as fixed and part of the prompt as dynamic, and thus does not teach
identifying at least one subset of the prompt as dynamic and at least one subset of the prompt as fixed.  
Acker, however, teaches identifying at least one subset of the prompt as dynamic and at least one subset of the prompt as fixed (the data representing the textual message, i.e. prompt, may comprise a variable portion, i.e. identifying at least one subset…as dynamic, and may further include a fixed portion, i.e. identifying… at least one subset of the prompt as fixed (3:19-24)).  
Wang, Nicolis, and Acker are analogous art because they are from a similar field of endeavor in synthesizing realistic TTS voices. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the recognition of a user speaking the prompt and receiving text input of the prompt to be synthesized teachings of Wang, as modified by Nicolis, with the recognition of parts of the message as being variable and fixed as taught by Acker. It would have been obvious to combine the references to enable the use of a combination of stored speech information with speech data converted in real-time (Acker (3:24-30)).

Claim(s) 4, 11, and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Wang, in view of Nicolis, in view of Acker, and further in view of Bakis et al. (U.S. PG Pub No. 2014/0058734), as found in the IDS, hereinafter Bakis..

Regarding claims 4, 11, and 18, Wang in view of Nicolis and Acker teaches claims 3, 10, and 17.
While Wang in view of Nicolis and Acker provides the recognition of fixed versus variable portions of a prompt, Wang in view of Nicolis and Acker does not specifically teach the tuning of only the fixed portion using prosody, and thus does not teach
extracting a plurality of prosodic information is performed only on the subset of the audio recording that corresponds to the fixed subset of the plurality of text.  
Bakis, however, teaches extracting a plurality of prosodic information is performed only on the subset of the audio recording that corresponds to the fixed subset of the plurality of text (a partial prompt, which can be in text format, such as the statement “your flight will be departing at”, i.e. fixed subset of the plurality of text [0031], can be tuned, where tuning can include the user specifying a sample recording to determine prosody of the synthesis, and the user can modify the prosodic targets of sections of audio that are of interest and specify speech segments that are not to be used, i.e. extracting a plurality of prosodic information is performed only on the subset of the audio recording [0028],[0033])  .  
Wang, Nicolis, Acker, and Bakis are analogous art because they are from a similar field of endeavor in synthesizing realistic TTS voices. Thus, it would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the recognition of fixed versus variable portions of a prompt teachings of Wang, as modified by Nicolis and Acker, with the use of specific sections of audio and tuning of only partial prompts as taught by Bakis. It would have been obvious to combine the references to enable users to have a greater control in how a prompt is synthesized (Bakis [0032]).
Conclusion
	
Any inquiry concerning this communication or earlier communications from the examiner should be directed to NICOLE A K SCHMIEDER whose telephone number is (571)270-1474. The examiner can normally be reached 8:00 - 5:00 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached on (571) 272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/NICOLE A K SCHMIEDER/Examiner, Art Unit 2659   

/PIERRE LOUIS DESIR/Supervisory Patent Examiner, Art Unit 2659