DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statements (IDS) submitted on November 14, 2019 and May 28, 2020 are being considered by the examiner.

Specification
The disclosure is objected to because of the following informalities: 
In paragraph [0176], “time limit 1210” should be amended to “constraint condition 1210”.  
In paragraph [0182], “time constraint 1210” should be amended to “constraint condition 1210”.
Appropriate correction is required.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –




(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claim(s) 1, 11, and 20 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Tachibana (U.S. Pat. App. Pub. No. 2009/0070115, hereinafter Tachibana).

Regarding claim 1, Tachibana discloses An electronic apparatus comprising: (Speech synthesis system including “CPU 204, a main storage (RAM) 206, [and] a hard disk drive (HDD) 208”; Tachibana, ¶ [0042]) a memory configured to store at least one instruction (“The HDD 208 further stores an operating system, a program for generating information related to a location detected by the GPS function or other text data to be speech-synthesized, and a speech synthesis program according to the present invention.”; Tachibana, ¶ [0044]); and a processor configured to execute the at least one instruction stored in the memory, which when executed causes the processor to control to (“the CPU 204 ... that enables the execution of an operating system”; Tachibana, ¶ [0043]) : based on obtaining a text input, obtain prosody information of the text input (“a language processing unit 108… obtains the reading (phonemes), accents, and word classes of the input text.” and “in a text analysis result block 124, a reading and accent are assigned to each of the divided words,” where reading and accent is the prosody information of the input text (text input) and reading and accent information are based on each of the divided words of the input text {based on obtaining a text input}; Tachibana, ¶¶ [0026], [0033]), segment the text input into a plurality of segments (“a language processing unit 122 obtains the reading (phonemes), accents, and word classes of the input text {text input}” where the input text “is divided into words {a plurality of segments}” and can include “parsing techniques.”; Tachibana, ¶¶ [0032], [0026]), obtain speech segments in which the prosody information is reflected to each segment of the plurality of segments in parallel (“In a synthesis block 126 by the waveform editing and synthesis unit, typically the following processes are sequentially performed: Obtaining prosody modification values {prosody information} using the prosody model 118; Reading candidates of speech segments from the speech segment database 116; Getting a speech segment sequence; [and] Applying prosody modification appropriately {prosody information being reflected into each segment}” where “the speech segment prosody [can be] smoothed in adjacent speech segments to obtain the final prosody,” thus in parallel.; Tachibana, ¶¶ [0034]-[0038], [0053]) by inputting the plurality of segments and the prosody information to a text-to-speech (TTS) module (“a text analysis result block 124, a reading and accent are assigned to each of the divided words,” where the divided words along with the assigned reading and accent {prosody information} are provided to the synthesis block 126 to read “candidates of speech segments from the speech segment database 116”, where the synthesis block 126 is the TTS module of the “text-to-speech synthesis system.”; Tachibana, ¶ [0033]-[0034], [0036], [0041]), and obtain a speech for the text input by merging the speech segments. (“Generating synthesized speech by concatenating speech segments,” where generating synthesized speech is obtaining a speech for the text input {speech segments are produced from the input text} and concatenating speech is merging the speech segments; Tachibana, ¶ [0039]).  

Regarding claim 11, Tachibana discloses A method of controlling an electronic apparatus, the method comprising (The method described with reference to the speech synthesis system; Tachibana, ¶ [0042]): based on obtaining a text input, obtaining prosody information of the text input (“a language processing unit 108… obtains the reading (phonemes), accents, and word classes of the input text.” and “in a text analysis result block 124, a reading and accent are assigned to each of the divided words,” where reading and accent is the prosody information of the input text (text input) and reading and accent information are based Tachibana, ¶¶ [0026], [0033]), segmenting the text input into a plurality of segments (“a language processing unit 122 obtains the reading (phonemes), accents, and word classes of the input text {text input}” where the input text “is divided into words {a plurality of segments}” and can include “parsing techniques.”; Tachibana, ¶¶ [0032], [0026]), obtaining speech segments in which the prosody information is reflected to each segment of the plurality of segments in parallel (“In a synthesis block 126 by the waveform editing and synthesis unit, typically the following processes are sequentially performed: Obtaining prosody modification values {prosody information} using the prosody model 118; Reading candidates of speech segments from the speech segment database 116; Getting a speech segment sequence; [and] Applying prosody modification appropriately {prosody information being reflected into each segment}” where “the speech segment prosody [can be] smoothed in adjacent speech segments to obtain the final prosody,” thus in parallel.; Tachibana, ¶¶ [0034]-[0038], [0053]) by inputting the plurality of segments and the prosody information to a text-to-speech (TTS) module (“a text analysis result block 124, a reading and accent are assigned to each of the divided words,” where the divided words along with the assigned reading and accent {prosody information} are provided to the synthesis block 126 to read “candidates of speech segments from the speech segment database 116”, where the synthesis block 126 is the TTS module of the “text-to-speech synthesis system.”; Tachibana, ¶ [0033]-[0034], [0036], [0041]); and obtaining a speech for the text input by merging the speech segments (“Generating synthesized speech by concatenating speech segments,” where generating synthesized speech is obtaining a speech for the text input {speech segments are produced from the input text} and concatenating speech is merging the speech segments; Tachibana, ¶ [0039]).  

Regarding claim 20, Tachibana discloses A non-transitory computer readable medium having stored thereon a program which when executed causes an electronic apparatus to perform a method of controlling the electronic apparatus, the method comprising (Speech synthesis system including “CPU 204, a main storage (RAM) 206, [and] a hard disk drive (HDD) 208” where “The HDD 208 further stores an operating system, a program for generating information related to a location detected by the GPS function or other text data to be speech-synthesized, and a speech synthesis program according to the present invention” and where “the CPU 204 ... that enables the execution of an operating system”; Tachibana, ¶¶ [0042]-[0044]): based on obtaining a text input, obtaining prosody information of the text input (“a language processing unit 108… obtains the reading (phonemes), accents, and word classes of the input text.” and “in a text analysis result block 124, a reading and accent are assigned to each of the divided words,” where reading and accent is the prosody information of the input text (text input) and reading and accent information are based on each of the divided words of the input text {based on obtaining a text input}; Tachibana, ¶¶ [0026], [0033]), segmenting the text input into a plurality of segments (“a language processing unit 122 obtains the reading (phonemes), accents, and word classes of the input text {text input}” where the input text “is divided into words {a plurality of segments}” and can include “parsing techniques.”; Tachibana, ¶¶ [0032], [0026]), obtaining speech segments in which the prosody information is reflected to each segment of the plurality of segments in parallel (“In a synthesis block 126 by the waveform editing and synthesis unit, typically the following processes are sequentially performed: Obtaining prosody modification values {prosody information} using the prosody model 118; Reading candidates of speech segments from the speech segment database 116; Getting a speech segment sequence; [and] Applying prosody modification appropriately {prosody information being reflected into each segment}” where “the speech segment prosody [can be] smoothed in adjacent speech segments to obtain the final prosody,” thus in parallel.; Tachibana, ¶¶ [0034]-[0038], [0053]) by inputting the plurality of segments and the prosody information to a text-to-speech (TTS) module (“a text analysis result block 124, a reading and accent are assigned to each of the divided words,” where the divided words along with the assigned reading and accent {prosody information} are Tachibana, ¶ [0033]-[0034], [0036], [0041]); and obtaining a speech for the text input by merging the speech segments (“Generating synthesized speech by concatenating speech segments,” where generating synthesized speech is obtaining a speech for the text input {speech segments are produced from the input text} and concatenating speech is merging the speech segments; Tachibana, ¶ [0039]).  

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 4 and 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Tachibana in view of Eller (U.S. Pat. App. Pub. No. 2013/0289998, hereinafter Eller).

Regarding claim 4, the rejection of claim 1 is incorporated. Tachibana disclose(s) all of the elements of the current invention as stated above. However, Tachibana fail(s) to expressly recite wherein the prosody information comprises intonation information, and accent information 

Eller teaches a “synthetic speech system” which incorporates contextual input. (Eller, ¶ [0003]). Regarding claim 4, Eller discloses wherein the prosody information comprises intonation information, and accent information of the text input (“Prosody handling subsystem 200 receives the output from text processing subsystem 100 as well as scenario data 020 and generates the rhythm, stress, and intonation of the speech,”; Eller, ¶¶ [0062], [0060]) based on at least one of a format, a syntactic structure, and a context of the text input. (“the context for the selected text input will define input parameters for the prosody selection subsystem,” thus including context of the text input. As well “when the role of the speaker is important to how the speech signal may need to be modified … additional lexical, syntax, prosodic or articulatory control input” may be incorporated into the domain/context selection, thus including syntactic structure; Eller, ¶¶ [0076], [0078], [0081]).  
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the concatenative speech synthesis system of Tachibana to incorporate the teachings of Eller to include wherein the prosody information comprises intonation information, and accent information of the text input based on at least one of a format, a syntactic structure, and a context of the text input. The systems and methods described in Eller can “expand a speaker's speech inventory so the system has the resources to synthesize speech from any given text in a realistic, natural-sounding way.” (Eller, ¶ [0007]).

Regarding claim 14, the rejection of claim 11 is incorporated. Claim 14 is substantially the same as claim 4 and is therefore rejected under the same rationale as above.

Claims 5 and 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Tachibana in view of Walker (U.S. Pat. App. Pub. No. 2001/0047260, hereinafter Walker).

Regarding claim 5, the rejection of claim 1 is incorporated. Tachibana disclose(s) all of the elements of the current invention as stated above. However, Tachibana fail(s) to expressly recite wherein each segment of the plurality of segments comprises index information that is related to an order in the text input, and wherein the processor when executing the at least one instruction is further configured to obtain the speech for the text input by merging the speech segments based on the index information.

Walker teaches “method and system for delivering text-to-speech in a real time telephony environment.” (Walker, ¶ [0002]). Regarding claim 5, Walker discloses wherein each segment of the plurality of segments comprises index information that is related to an order in the text input (“A text-to-speech (TTS) resource manager is operable for dividing the text document into text document segments and associating a sequence number with each text document segment” where the text document segments {plurality of segments} comprises the sequence number {index information} where “TTS resource manager places the text document segments and the corresponding sequence numbers in a sequential order within a queue” {thus being related to an order in the text input.}; Walker, ¶ [0010]), and wherein the processor when executing the at least one instruction is further configured to obtain the speech for the text input by merging the speech segments based on the index information (the system uses “the queue Walker, ¶ [0010]).  

It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the concatenative speech synthesis system of Tachibana to incorporate the teachings of Walker to include wherein each segment of the plurality of segments comprises index information that is related to an order in the text input, and wherein the processor when executing the at least one instruction is further configured to obtain the speech for the text input by merging the speech segments based on the index information. The systems and methods of Walker allow for “continuous playing of an audio stream while not overloading the voice application with unnecessary buffers which the voice application would need to manage.” (Walker, ¶ [0018]).

Regarding claim 15, the rejection of claim 11 is incorporated. Claim 15 is substantially the same as claim 5 and is therefore rejected under the same rationale as above.

Claims 6 is/are rejected under 35 U.S.C. 103 as being unpatentable over Tachibana in view of Non-patent literature to Wu et al. (Z. Wu, C. Valentini-Botinhao, O. Watts, and S. King, “Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis,” in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 4460–4464, hereinafter Wu).

Regarding claim 6, the rejection of claim 1 is incorporated. Tachibana disclose(s) all of the elements of the current invention as stated above. However, Tachibana fail(s) to expressly recite wherein the TTS module is a deep neural network text-to-speech (DNN TTS) module.

Wu teaches “multi-task learning (MTL) in a DNN” for speech synthesis. (Wu, pg. 4460, Col. 2, lines 24-25). Regarding claim 6, Wu discloses wherein the TTS module is a deep neural network text-to-speech (DNN TTS) module. (Discloses “DNN-based speech synthesis... [where] phone or state durations are predicted and the linguistic features for each frame are mapped to vocoder parameters, which are then passed to a synthesis filter to reconstruct the speech.”; Wu, pg. 4461, Col. 1, lines 1 and 7-10).  

It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the concatenative speech synthesis system of Tachibana to incorporate the teachings of Wu to include [claim language determined to be found in Wu]. “Deep neural networks (DNNs) have the potential to address” two weaknesses in HMM-based speech synthesis:  “the density function over the acoustic features (usually a Gaussian) and the decision-tree driven parameterisation of the model,” as recognized by Wu. (Wu, pg. 4460, Col. 1, lines 22-27).

Claims 7 and 16 is/are rejected under 35 U.S.C. 103 as being unpatentable over Tachibana in view of Christian (U.S. Pat. App. Pub. No. 2017/0256252, hereinafter Christian) and Erickson (U.S. Pat. App. Pub. No. 2017/0116187, hereinafter Erickson).

Regarding claim 7, the rejection of claim 1 is incorporated. Tachibana disclose(s) all of the elements of the current invention as stated above. However, Tachibana fail(s) to expressly recite further comprising: a speaker, wherein the processor when executing the at least one instruction is further configured to control the speaker to output a rejoinder speech and the speech for the text input after the rejoinder speech.

Christian teaches systems and methods for “providing non-lexical cues in text-to-speech output.” (Christian, ¶ [0002]). Regarding claim 7, Christian discloses further comprising: a speaker, (The system includes an “audio output 106 [which] may be a speaker or an output port to transmit a signal including audio output to another system.”; Christian, ¶ [0010]) wherein the processor when executing the at least one instruction is further configured to control the speaker to output a rejoinder speech … [to fill a silent pause]. (“the user adaptive dialogue system 120 may include … a non-lexical cue insertion engine 130 [and] a speech synthesizer 126” where “word-like non-lexical cues ... may be selected to fill a {silent} pause.”; Christian, ¶ [0017]). 

It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the concatenative speech synthesis system of Tachibana to incorporate the teachings of Christian to include wherein the processor when executing the at least one instruction is further configured to control the speaker to output a rejoinder speech … [to fill a silent pause]. “Incorporating non-lexical cues can infuse added meaning to the output and improve the ability of a hearer to comprehend the output… [as well as] Christian. (Christian, ¶ [0008]). However, Tachibana and Christian fail(s) to expressly recite wherein the processor when executing the at least one instruction is further configured to control the speaker to output a rejoinder speech and the speech for the text input after the rejoinder speech.

Erickson teaches “natural language dialogue systems” including “natural language generation (NLG).” (Erickson, ¶ [0001]). Regarding claim 7, Erickson discloses wherein the processor when executing the at least one instruction is further configured to control the speaker to output a rejoinder speech and the speech for the text input after the rejoinder speech. (The system includes “inserting non-lexical utterances like 'um' and 'er' before a low confidence word or phrase or sentence” or “adding an expression or gesture indicating uncertainty just before or during the production of the low confidence word or phrase or sentence” where the non-lexical utterance and the expression or gesture indicating uncertainty are the rejoinder speech, and where the non-lexical utterance and the expression or gesture indicating uncertainty {rejoinder speech} are input prior to the word or phrase or sentence {speech for the text input}; Erickson, ¶ [0092]).  

It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the concatenative speech synthesis system of Tachibana as modified by the text-to-speech output system including non-lexical cues of Christian to incorporate the teachings of Erickson to include Tachibana and Christian fail(s) to expressly recite wherein the processor when executing the at least one instruction is further configured to control the speaker to output a rejoinder speech and the speech for the text input after the rejoinder speech. The systems and methods described in Erickson can “effectively and efficiently control the manner in which human users experience errors in the NL outputs of NLP systems.” (Erickson, ¶ [0006]).

Regarding claim 16, the rejection of claim 11is incorporated. Claim 16 is substantially the same as claim 7 and is therefore rejected under the same rationale as above.

Claims 9 and 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Tachibana in view of Osowski (U.S. Pat. App. Pub. No. 2014/0200894, hereinafter Osowski).

Regarding claim 9, the rejection of claim 1 is incorporated. Tachibana disclose(s) all of the elements of the current invention as stated above. However, Tachibana fail(s) to expressly recite further comprising: a communicator; and a speaker, wherein the processor when executing the at least one instruction is further configured to: obtain first segments of the plurality of segments in which the prosody information is reflected to each of the first segments in parallel by inputting the first segments and the prosody information to the text-to-speech (TTS) module, transmit, to an external device for speech synthesis, a remaining segment among the plurality of segments and the prosody information through the communicator, obtain a remaining speech for the remaining segment from the external device performing the speech synthesis on the remaining segment through the communicator, and output the speech by merging the first segments obtained in parallel and the remaining speech received from the external device through the speaker.

Osowski teaches a “system and method to perform certain TTS processing on local devices.” (Osowski, ¶ [0010]). Regarding claim 9, Osowski discloses further comprising: a communicator (“input/output device 206 may also include a network connection”; Osowski, ¶ [0017], FIG. 2); and a speaker (“audio output device 204”; Osowski, ¶ [0015], FIG. 2), wherein the processor when executing the at least one instruction is further configured to: obtain first segments of the plurality of segments (“the local device may receive text data for Osowski, ¶ [0054]) in which the prosody information is reflected to each of the first segments in parallel (“Text input into a TTS module 214 may be sent to the FE 216 for processing. The front-end may include modules for performing text normalization, linguistic analysis, and prosody generation.”; Osowski, ¶ [0020]) by inputting the first segments and the prosody information to the text-to-speech (TTS) module (“ As shown in block 508, the local device may then perform speech synthesis using units available in the local unit database.”; Osowski, ¶ [0054]), transmit, to an external device for speech synthesis, a remaining segment among the plurality of segments and the prosody information through the communicator (“In one aspect, local TTS processing may also be combined with distributed TTS processing. Where a portion of text to be converted uses units available in a local database, that portion of text may be processed locally. Where a portion of text to be converted uses units not available in a local database, the local device may obtain the units from a remote device” and the “selection of units from input text may be performed by a remote device where the remote device is aware of what units are available on a local device. The remote device may determine the desired units to use in synthesizing the text and send the local device the unit sequence, along with the unit speech segments that are unavailable on the local device.”; Osowski, ¶¶ [0040]-[0041]), obtain a remaining speech for the remaining segment from the external device performing the speech synthesis on the remaining segment through the communicator (“For units which are not available in the local unit database, or for units where other unit examples are desired, the local device may obtain audio segments corresponding to other units from a remote device, as shown in block 510,”; Osowski, ¶ [0054]), and output the speech by merging the first segments obtained in parallel and the remaining speech received from the external device through the speaker. (“The units from the remote device {remaining speech received from the external device} may then concatenated with the local units {first segments obtained in parallel} for construction of the audio Osowski, ¶¶ [0040], [0015], FIG. 2).  
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the concatenative speech synthesis system of Tachibana to incorporate the teachings of Osowski to include further comprising: a communicator; and a speaker, wherein the processor when executing the at least one instruction is further configured to: obtain first segments of the plurality of segments in which the prosody information is reflected to each of the first segments in parallel by inputting the first segments and the prosody information to the text-to-speech (TTS) module, transmit, to an external device for speech synthesis, a remaining segment among the plurality of segments and the prosody information through the communicator, obtain a remaining speech for the remaining segment from the external device performing the speech synthesis on the remaining segment through the communicator, and output the speech by merging the first segments obtained in parallel and the remaining speech received from the external device through the speaker. Distributed TTS, where some tasks are performed by a local device can overcome latency and network unavailability problems found in prior art TTS solutions, as recognized by Osowski. (Osowski, ¶ [0010]).

Regarding claim 18, the rejection of claim 11 is incorporated. Claim 18 is substantially the same as claim 9 and is therefore rejected under the same rationale as above.

Claims 10 and 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Tachibana and Osowski as applied to claims 9 and 18 above, and further in view of Walker.

Regarding claim 10, the rejection of claim 9 is incorporated. Tachibana and Osowski disclose(s) all of the elements of the current invention as stated above. However, Tachibana and Osowski fail(s) to expressly recite wherein the first segments correspond to a beginning part of the text input, and wherein the processor when executing the at least one instruction is further configured to output the speech by outputting the first segments and outputting the remaining speech received from the external device after outputting the first segments through the speaker.

The relevance of Walker is disclosed above with reference to claim 5.  Regarding claim 10, Walker discloses wherein the first segments correspond to a beginning part of the text input (“TTS engine 22 a converts the first text segment into a first speech segment and associates sequence identifier #1 with the first speech segment” where “Text-to-speech (TTS) engines are computing devices which convert written text into audible computer generated speech,” thus separate computing devices.; Walker, ¶¶ [0028], [0003]), and wherein the processor when executing the at least one instruction is further configured to output the speech by outputting the first segments (“After a TTS engine 22 Walker, ¶¶ [0030]-[0031]) and outputting the remaining speech received from the external device after outputting the first segments through the speaker. (“TTS engine 22 b converts the second text segment into a second speech segment and associates sequence identifier #2 with the second speech segment (remaining speech)” where “streaming buffer 24 uses the corresponding sequence identifiers to reassemble the first speech segment before the second speech segment in the proper sequential order,” where the TTS engine 22 a and the TTS engine 22 b are separate computing devices. Thus, the second speech segment (remaining speech) is output after the first speech segment (first segments); Walker, ¶ [0030]).  

It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the concatenative speech synthesis system of Tachibana as modified by the systems and methods for local TTS of Osowski to incorporate the teachings of Walker to include wherein the first segments correspond to a beginning part of the text input, and wherein the processor when executing the at least one instruction is further configured to output the speech by outputting the first segments and outputting the remaining speech received from the external device after outputting the first segments through the speaker. The systems and methods of Walker allow for “continuous playing of an audio stream while not overloading the voice application with unnecessary buffers which the voice application would need to manage.” (Walker, ¶ [0018]).

Regarding claim 19, the rejection of claim 18 is incorporated. Claim 19 is substantially the same as claim 10 and is therefore rejected under the same rationale as above.


Allowable Subject Matter
Claims 2-3, 8, 12-13, and 17 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The following is a statement of reasons for the indication of allowable subject matter: The closest prior art has been identified and made of record.
Regarding claims 2-3, and similarly regarding claims 12-13, the prior art made of record fails to expressly disclose “wherein the processor when executing the at least one instruction is further configured to: obtain a plurality of first segments by segmenting the text input based on a first criterion, and based on a first processing time for converting the plurality of first segments to the speech segments being less than a predetermined time, input the plurality of first segments to the TTS module, based on the first processing time for converting at least one first segment of the plurality of first segments to the speech segments being greater than or equal to the predetermined time, obtain a plurality of second segments by segmenting the at least one first segment based on a second criterion, and based on a second processing time for converting the plurality of second segments to the speech segments being less than the predetermined time, input the plurality of second segments the TTS module.” At best, Neeracher et al. (U.S. Pat. App. Pub. No. 2007/0192105, hereinafter Neeracher) discloses speech synthesis including parsing text into units, where the “unit matching engine 230… matches units from a text string to audio Neeracher, ¶ [0034]). However, Neeracher, at least, fails to teach or suggest “based on a first processing time for converting the plurality of first segments to the speech segments being less than a predetermined time, input the plurality of first segments to the TTS module, [and] based on the first processing time for converting at least one first segment of the plurality of first segments to the speech segments being greater than or equal to the predetermined time, obtain a plurality of second segments by segmenting the at least one first segment based on a second criterion, and based on a second processing time for converting the plurality of second segments to the speech segments being less than the predetermined time, input the plurality of second segments the TTS module.”
Regarding claim 8, and similarly regarding claim 17, the prior art made of record fails to expressly disclose “wherein the memory is further configured to store a plurality of rejoinder speech, and wherein the processor when executing the at least one instruction is further configured to: identify the rejoinder speech from among the plurality of rejoinder speech based on a processing time to obtain the speech for the text input.” At best, Silverman (U.S. Pat. App. Pub. No. 2008/0071529, hereinafter Silverman) discloses selection of non-speech sounds for use in synthesized speech based on unit and concatenation costs. (Silverman, ¶¶ [0007], [0040], [0049]). However, Silverman, at least, fails to disclose “based on a second processing time for converting the plurality of second segments to the speech segments being less than the predetermined time, input[ting] the plurality of second segments the TTS module.”

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Sean E. Serraguard whose telephone number is (313)446-6627.  The examiner can normally be reached on 07:00-17:00 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is 
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel C. Washburn can be reached on (571) 272-5551.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/SES/Patent Examiner, Art Unit 2657                                                                                                                                                                                                        
/Paras D Shah/Primary Examiner, Art Unit 2659                                                                                                                                                                                                        

06/16/2021