DETAILED ACTION
This office action is in response to Applicant’s submission filed on 9/29/2022. Claims 1 – 2, 7, and 15 are pending in the application. Claims 1, and 15 are amended. As such, claims 1 – 2, 7, and 15 are examined. Please see below for more detail.


Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant’s arguments filed in the RCE Amendment filed 9/29/2022 (herein “Amendment”) with respect to the 35 USC §103 rejection for claim 1 and 15, raised in the previous office action over Isobe, and in view of Eagleman, Cengiz, Traupman, and Kang raised in the previous office action have been considered, but are persuasive only to the extent that the amendments have changed the broadest reasonable interpretation, thus necessitating a new ground of rejection in view of newly cited references Lai et al. (US 20170193533 A1), Chen et al. (US 7092870 B1), and Chicote et al. (US 10706837 B1).
Therefore, while all of the Applicant’s arguments and amendments filed in the Amendment have been fully considered, they are not persuasive. Please see below for more detail including updated citations and obviousness rationale.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1, 2, and 15 are rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter without significantly more. The claims as whole, considering all claim elements both individually and in combination, do not amount to significantly more than an abstract idea.
	The independent claims 1 and 15 recites: “An artificial intelligence (Al) apparatus for mutually converting a text and a speech, comprising: a memory configured to store a plurality of Text-To-Speech (TTS) engines; and a processor configured to: Obtain image data containing a text, Determine a speech style corresponding to the text, generate a speech corresponding to the text by using a TTS engine corresponding to the determined speech style among the plurality of TTS engines, and output the generated speech, wherein the processor is further configured to: extract at least one text style feature from the text included the image data, if the text is handwritten, determine a text creator corresponding to an age and a gender based on the at least one text style feature, and determine a speech style corresponding to the determined text creator.”
	The limitation of “obtaining image data containing a text”, “determining a speech style corresponding to the text”, “outputting the generated speech”, “extracting at least one text style”, “determining text creator corresponding to an age and a gender if text is handwritten”, and “determining speech style of the text creator” as drafted covers a mental process, as such they all point to an abstract idea.  
Obtaining an image data containing a text can be carried out by a human, by simply looking over available image data and decide which one contains a text. Determining a speech style corresponding to the text can also be accomplished by a human. An individual can look at the text and recognize the text could be a news as an example and can in return recite that text as a news announcer or if it is a portion of children story recite that as a child would.  Outputting the generated speech can also be carried out by a human by verbalizing the content of the text. Extracting at least one text style, can further be performed by a human via looking at the text and figure out the pertinent style associated with the text. Determining text creator corresponding to an age and a gender if text is handwritten, also a human can look at a handwritten textual material and suggest if the text is written by which age group, a child or an educated person and or uneducated individual has written the text based on the text structure, any obvious typographical mistake etc. Also based on how firmly handwritten text has been written can suggest whether a male or female has written the text. Determining speech style of the text creator, finally a human can decipher a speech style of the author, if the material is written by a child, adult as mentioned before, can imitate speech style accordingly.
This judicial exception is not integrated into a practical application. Even though claims 1, 15 and recites dependency to processors, and a storage device, or programs, it is not considered an additional element due to lack of specificity, and it is considered as a generic computer (or processor) -see par. 0106 of the Applicant’s Specification. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept, as mentioned earlier. Beyond the added generic processor, the claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception, because they do not impose any meaningful limits on practicing the abstract idea–see MPEP 2106.05(f), 2106.04(d). The claim is directed to an abstract idea.
Likewise, the claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional element of using a computer which due to lack of specificity, is considered as a general-purpose computer (or processor). Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Moreover, the limitation in the claims noted above taken individual or as an ordered set do not amount to significantly more than judicial exception. As such they are directed to an abstract idea (mental process) as discussed. Thus, neither of the additional elements nor limitations ‘as taken individually or ordered set’ amount to significantly more solution activity. The claims are not patent eligible.
Similarly, dependent claim 2, include additional steps that are directed plurality of TTS engines, where in speech style includes tone, pitch etc. is considered “insignificant extra-solution activity to the judicial exception” because it fails to provide meaningful significance that go beyond generally linking the use of an abstract idea to a particular technological environment.  Therefore, this claim is also not patent eligible.
Therefore, claims 1 -2, and 15 are not patent eligible under 35 USC 101.



Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1, and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Isobe (US20110093272A1), Eagleman (US20180233163A1), Lai et al. (US20170193533A1) (herein "Lai"), Chen et al. (US7092870B1)(herein "Chen"), and Chicote et al. (US10706837B1)(herein "Chicote").

Isobe and Eagleman were applied in the previous Office Action.
Regarding claim 1, and 15 Isobe teaches [An artificial intelligence (AI) apparatus for mutually converting a text and a speech - claim 1] and [a method for mutually converting a text and a speech, the method – claim 15] comprising: a memory configured to store a plurality of Text-To- Speech (TTS) engines; and (Isobe, Par. 0064:"The pieces of speech data synthesized respectively by each emotion are finally output as a speech message of one sentence.", and Par. 0065:" Data stored in speech synthesis data storage device 305 is used by speech data synthesizer 303 to generate speech synthesis data. That is, speech synthesis data storage device 305 supplies data for speech synthesis and parameters to speech data synthesizer 303.”),
generate a speech corresponding to the text by using a TTS engine corresponding to the determined speech style among the plurality of TTS engines, and (Isobe, Par. 0014:” In still another preferred embodiment of the present invention, the speech synthesis data storage device may additionally store a parameter for setting, for each emotion class, the characteristics of a speech pattern for each user of the plural communication terminals, and the speech data synthesizer may adjust the synthesized speech data based on the parameter [style]. In the present embodiment, because speech data is adjusted by using a parameter depending on a type of emotion stored for each user, speech data that matches the characteristics of the speech pattern of a user are generated. Therefore, it is possible to generate a speech message that reflects the individual characteristics of voice of a user who is a transmitter.”),
output the generated speech, (Isobe, Par. 0064:” The pieces of speech data synthesized respectively by each emotion are finally output as a speech message of one sentence.").
Isobe fails to explicitly disclose, however, Eagleman teaches a processor configured to: obtain image data containing a text (Eagleman, Par. 0049:” In one variant, the input information includes information related to a user's surroundings and/or the surroundings of a system component [e.g., sensor], such as information associated with nearby objects and/or people [e.g., wherein Block S110 includes extracting information, such as text, semantic meaning, conceptual information, and/or any other suitable information, from the input information]. In a first example of this variant, the system includes an image sensor [e.g., camera], and the text input includes text recognized in images captured by the image sensor [e.g., automatically detected, such as by performing image segmentation, optical character recognition, etc.], and/or other extracted information [e.g., conceptual information] includes information discerned from the images [e.g., as described above].”).
determine a speech style corresponding to the text. (Eagleman, Par. 0055:” In one variation, Block S120 can include implementing a text-to-speech [TTS] engine that extracts a set of acoustic components [e.g., a closed set of acoustic components] from the communication data of Block S110. The TTS engine can implement a synthesizer that converts language text into speech and/or renders symbolic linguistic representations [e.g., phonetic transcriptions, phonemic transcriptions, morphological transcriptions, etc.] into speech components without generating sound. The acoustic components can include phonemes [or sounds at the resolution of phonemes], or finer-time-scale acoustic components used to construct phonemes. As such, the acoustic components can be phonemes, sub-phoneme components, and/or super-phoneme assemblies, and can include aspects of tone, stress, or any other suitable phoneme feature.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Isobe in view of Eagleman to obtain image data containing a text, determine a speech style corresponding to the text, in order to allow a user to haptically receive communications originating in a textual format, as evidence by Eagleman (See Par. 0019).
Isobe and Eagleman fail to explicitly disclose, however, Lai teaches wherein the processor is further configured to: extract at least one text style feature from the text included the image data, (Lai, ABS:” Embodiments are directed to a computer implemented method of analyzing image data. The method includes receiving, using a processor system, image data of one or more images and associated text data that have been posted by a user. The method further includes analyzing the image and text data to extract one or more image and one or more text features, and analyzing the one or more image and one or more text features to predict personality traits, needs and values of the user.”)
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Isobe and Eagleman in view of Lai to extract at least one text style feature from the text included the image data, in order to ptovide an efficient method for gathering personality trait of the users, as evidence by Lai (See Par. 0045).
Isobe, Eagleman and Lai fail to explicitly disclose, however, Chen teaches if the text is handwritten, determine a text creator corresponding to an age and a gender based on the at least one text style feature, and (Chen, Col. 10, lines 41 – 45:” The stored textual data can be also associated with handwriting biometrics that provide additional information about speakers [for example, a conventional method known in the art is used to relate handwriting manner [style] to a social user status, age, sex, etc.].”
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Isobe, Eagleman and Lai in view of Chen to if the text is handwritten, determine a text creator corresponding to an age and a gender based on the at least one text style feature, in order to automatic indexing of handwriting data, as evidence by Chen (See Col. 1, Lines 20-21).
Isobe, Eagleman, Lai and Chen fail to explicitly disclose, however, Chicote teaches determine a speech style corresponding to the determined text creator. (Chicote, Col. 16, lines 10- 42:” Instead of or in addition to the re-training of one or more of the various sub-models described above, with reference also to FIG. 8A, the text metadata 215 may be changed or replaced to change an attribute or style of the output audio data 290. As discussed above with reference to FIGS. 9 and 10, the text metadata 215 may be generated from the input text data 210; in other embodiments, however, the text metadata 215 may be wholly or partially generated from different voice and/or text data by, for example, building a system using the different voice and/or text data, as described above, and using components from that system to generate the text metadata 1115. For example, the input text data 210, training audio 902, and/or training text 904 may represent speech in a neutral tone or style such that the speech model 222 generates output audio data 290 in a corresponding neutral tone or style. The text metadata 215 may, however, be generated using training data in a different tone or style, and may be input to the speech model 222 to thereby change a vocal attribute of the output audio data 290. For example, the text metadata 215 may correspond to the tone or style of speech of a television newscaster, actor, child, or other such style; the text metadata 215 may also correspond to an accent associated with a particular language and/or region. The text metadata 215 may further correspond to a particular person, such as a celebrity. The text metadata 215 associated with a particular person may be generated using audio and text data associated with that person; the text metadata 215 associated with a style may be generated using audio and text data associated with a person exemplifying that style or from a blend or mix of persons exemplifying that style. The resultant output audio data 290 may be recognizable to a listener as belonging to the original speaker but modified by the various tones or styles.”)
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Isobe, Eagleman, Lai and Chen in view of Chicote to determine a speech style corresponding to the determined text creator, in order to convert the input text into high-quality natural-sounding speech in an efficient manner, as evidence by Chicote (See Col. 5, lines 39 - 40).


Claim 2 is rejected under 35 U.S.C. 103 as being unpatentable over  Isobe, and Eagleman, Cengiz, Traupman, and Kang as applied to claim 1, and in further view of Pollet (US20160093289A1) (hereinafter " Pollet").

Pollet was applied in the previous Office Action
Regarding claim 2, Isobe, Eagleman, Cengiz, Traupman, and Kang fail to explicitly disclose, however, Pollet teaches wherein each of the plurality of TTS engines includes at least one speech style feature, and (Pollet, Par. 0016:” The conventional approach to enabling a concatenative TTS system to render input text as speech in any one of multiple speech styles involves creating, for each speaking style, a database of speech segments by segmenting recordings of speech spoken in that style. In response to a user request to render input text as speech having a specified style, the conventional TTS system renders the input text as speech by using speech segments from the database of speech segments corresponding to the specified style. As a result, to render text as speech having a specified style, conventional TTS systems use only those speech segments that were obtained from recordings of speech spoken in the specified style. However, the inventors have recognized that obtaining a speech database having an adequate number of speech segments to allow for high-quality synthesis....”).
wherein the speech style feature includes at least one of a tone, a pitch, a speed, an accent, a speech volume, or a pronunciation (Pollet, Par. 0015:” Some embodiments are directed to multi-style synthesis techniques for rendering text as speech in any one multiple different styles. For example, text may be rendered as speech having a style that expresses an emotion, non-limiting examples of which include happiness, excitement, hesitation, anger, sadness, and nervousness. As another example, text may be rendered as speech having a style of speech spoken for a broadcast [e.g., newscast speech, sports commentary speech, speech during a debate, etc.]. As yet another example, text may be rendered as speech having a style of speech spoken in a dialogue among two or more people [e.g., speech from a conversation among friends, speech from an interview, etc.]. As yet another example, text may be rendered as speech having a style of speech spoken by a reader reading content aloud. As yet another example, text may be rendered as speech having a particular dialect or accent. As yet another example, text may be rendered as speech spoken by a particular type of speaker [e.g., a child/adult/elderly male or female speaker]. The above-described examples of speech styles are illustrative and not limiting, as the TTS synthesis techniques described herein may be used to generate speech having any other suitable style.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Isobe, Eagleman, Cengiz, Traupman, and Kang in view of Pollet to wherein each of the plurality of TTS engines includes at least one speech style feature, and wherein the speech style feature includes at least one of a tone, a pitch, a speed, an accent, a speech volume, or a pronunciation, in order to render input text as speech via concatenative synthesis, as evidence by Pollet (see par 0013).

Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over  Isobe, and Eagleman, Cengiz, Traupman, and Kang as applied to claim 1, and in further view of Chen  (US20140025382A1).

Chen was applied in the previous Office Action
Regarding claim 7, Isobe, Eagleman, Cengiz, Traupman, and Kang fail to explicitly disclose, however, Chen teaches wherein at least one of the plurality of TTS engines is learned using a machine learning algorithm or a deep learning algorithm. (Chen, Par. 0019:”In an embodiment a text to speech method is provided, the method comprising:”, and Par. 0020:”receiving input text”, and Par. 0021:”dividing said inputted text into a sequence of acoustic units;”, Par. 0022:”converting said sequence of acoustic units to a sequence of speech vectors using an acoustic model, wherein said model has a plurality of model parameters describing probability distributions which relate an acoustic unit to a speech vector; and”, and Par. 0023:”outputting said sequence of speech vectors as audio,”, and Par. 0024:”the method further comprising determining at least some of said model parameters by:”, and Par. 0025:”extracting expressive features from said input text to form an expressive linguistic feature vector constructed in a first space; and”, and Par. 0026:’mapping said expressive linguistic feature vector to an expressive synthesis feature vector which is constructed in a second space.”, and Par. 0027:”In an embodiment, mapping the expressive linguistic feature vector to an expressive synthesis feature vector comprises using a machine learning algorithm, for example, a neural network.”, and Par. 0028:”The second space may be a multi-dimensional continuous space. This allows a smooth change of expression in the outputted audio.”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Isobe, Eagleman, Cengiz, Traupman, and Kang in view of Chen to wherein at least one of the plurality of TTS engines is learned using a machine learning algorithm or a deep learning algorithm, in order to improve training transformation since the number of patterns which need to be learnt is reduced, as evidence by Chen (See Par. 0119).


Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure. Bakis et al (U.S. Patent Application Number: US20050096909A1) teaches (Par. 0005) “a method which includes identifying text to convert to speech, selecting a speech style sheet from a set of available speech style sheets, the speech style sheet defining desired speech characteristics, marking the text to associate the text with the selected speech style sheet, and converting the text to speech having the desired speech characteristics by applying a low level markup associated with the speech style sheet.”
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DARIOUSH AGAHI whose telephone number is (408)918-7689. The examiner can normally be reached Monday - Thursday and alternate Fridays, 7:30-4:30 PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/DARIOUSH AGAHI/             Examiner, Art Unit 2656                                                                                                                                                                                           /BHAVESH M MEHTA/Supervisory Patent Examiner, Art Unit 2656