DETAILED ACTION
This communication is in response to the Application filed on 16 August 2019. Claims 1-18 are pending and have been examined.
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 16 August 2019 is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.

Compact Prosecution
In the interest of compact prosecution, the examiner suggests that the applicant incorporate more detail into the independent claims so as to distinguish the applicant’s figure 10 from the prior art.

Claim Objections
Claim 11 is objected to because of the following informalities: a period is missing at the end of the claim. Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claim 5 recites the limitation "borrowing language". There is insufficient antecedent basis for this limitation in the claim.
Claim 6 recites the limitation "placing". There is insufficient antecedent basis for this limitation in the claim.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claim(s) 1 and 3-4 is/are rejected under 35 U.S.C. 102(a)(2) as being anticipated by US 20180268806, hereinafter referred to as Chun et al.

Regarding claim 1, Chun et al. discloses a method of learning a phonetic embedding space (“Through training, the linguistic encoder 114 learns to produce a speech unit representation or "embedding" for a linguistic unit. The linguistic encoder 114 receives data indicating a linguistic unit, such as a phoneme, and provides an embedding representing acoustic characteristics that express the linguistic unit,” Chun et al., para [0027].), the method comprising: 
During stage (B), the TTS system 102 obtains data indicating linguistic units 134a-134c corresponding to the obtained text 146. For example, the TTS system 102 may access a lexicon to identify a sequence of linguistic units, such as phones, in a phonetic representation of the text 146. The linguistic units can be selected from a set of context-dependent phones used to train the linguistic encoder 114. The same set of linguistic units used for training can be used during speech synthesis for consistency,” Chun et al., para [0065]. See also Chun et al., fig. 1B.); 
extracting a plurality of acoustic features from each phoneme segment, each acoustic feature corresponding to a dimension of the embedding space (“During stage (D), the linguistic encoder 114 outputs an embedding 118a in response to the linguistic unit identifier 108. The acoustic encoder 116 outputs an embedding 118b in response to the acoustic feature vectors 110. Embeddings 118a and 118b can be the same size as each other, and can be the same size for all linguistic units and lengths of audio data. For example, the embeddings 118a and 118b may be 32-bit vectors,” Chun et al., para [0039]. Each vector representing an acoustic feature corresponds to a dimension.); and 

training a model, using a multiplicity of speech audio features, to define contiguous regions within the embedding space, each region corresponding to a label (Chun et al.,    FIG. 4 is a flow diagram that illustrates an example of a process for training an autoencoder. Chun et al., para [0073], explains how diphone embeddings 132b may join in sequence thereby defining contiguous regions within the embedding space.), 

whereby a given new segment of speech audio of a phoneme can be mapped to a single region within the embedding space (As noted above, each bit sequence represents a single region within the embedding space.).  


Regarding claim 3, Chun et al. discloses a method of recognizing a phoneme, the method comprising: 

extracting a segment of speech audio (“In the illustrated example, the TTS system 102 obtains a training example 106, which includes a linguistic label 106a and associated audio data 106b. For example, the label 106a indicates that the audio data 106b represents an "/e/" phone. In some implementations, the TTS system 102 may extract examples representing individual linguistic units from longer audio segments,” Chun et al., [0031].); 

extracting a plurality of acoustic features from the segment of speech audio (“Through training, the linguistic encoder 114 learns to produce a speech unit representation or "embedding" for a linguistic unit. The linguistic encoder 114 receives data indicating a linguistic unit, such as a phoneme, and provides an embedding representing acoustic characteristics that express the linguistic unit,” Chun et al., para [0027].); and  

19000P000130applying a model to the acoustic features to map the segment to a region representing a phoneme, wherein the phoneme represents the pronunciation of the segment of speech audio (“In the illustrated example, the TTS system 102 obtains the text 146 of the word "hello" to be synthesized. The TTS system 102 determines the sequence of linguistic units 134a-134d that represent the pronunciation of the text 146. Specifically, the linguistic units include linguistic unit 134a "/h/", linguistic unit 134b "/e/", and linguistic unit 134c "/l/," and linguistic unit 134d "/o/.",” Chun et al., para [0066]. And, as noted in Chun et al., para [0027], the linguistic units include phonemes.).  

Regarding claim 4, Chun et al. discloses the method of claim 3 further comprising: 

repeating for a string of consecutive speech audio segments, applying the model to acoustic feature of the string of speech audio segments, to yield a string of phonemes (“The TTS system 102 may access stored feature vectors for the audio data 106b from the data storage 104 or perform feature extraction on the audio data 106b. For example, the TTS system 102 analyzes different segments or analysis windows of the audio data 106b. These windows are shown as w.sub.0, . . . w.sub.n, and can be referred to as frames of the audio. In some implementations, each window or frame represents the same fixed-size amount of audio, e.g., 5 milliseconds (ms) of audio. The windows may partially overlap or may not overlap. For the audio data 106, a first frame wo may represent the segment from 0 ms to 5 ms, a second window w.sub.1 may represent a segment from 5 ms to 10 ms, and so on,” Chun et al., para [0035].); and 

adding the string of phonemes to a dictionary as a new word (“In some implementations, the TTS system 102 may extract examples representing individual linguistic units from longer audio segments. For example, the data storage 104 can include audio data for utterances and corresponding text transcriptions of the utterances. The TTS system 102 can use a lexicon to identify a sequence of linguistic units, such as phones, for each text transcription,” Chun et al., para [0031].).  

Claim(s) 13 is/are rejected under 35 U.S.C. 102(a)(2) as being anticipated by US 7369993, hereinafter referred to as Atal.

Regarding claim 13, Atal discloses a method of analyzing pronunciations (“The lexical constraints embodied in the pronunciation of words make it possible to recognize words in the presence of mis-recognized phonemes,” Atal, col. 10, lines 59-62.), the method comprising:  

19000P000132determining a midpoint value for each of the regions of a phonetic embedding space (“The comparison of the first distance with the second distance is illustrated in FIG. 13. This figure shows geometrically the comparison of distances from 5 stored phonemes to a received phoneme (260) in a hypershere,” Atal, col. 13, lines 36-39. Here, the center of each phoneme circle represents the midpoint value of the phoneme in space. The examples are in the English language.); 

receiving a segment of speech audio (“A speech-receiving device receives audio signals and converts the analog audio signals into digital signals,” Atal, col. 3, lines 64-66.); 

extracting a plurality of acoustic features from the segment of speech audio to determine a phoneme vector, each acoustic feature corresponding to a dimension of the embedding space (“Essentially, as will be discussed herein, the present invention provides a system and a method for representing acoustic signals in a high-dimensional, hyperspherical space that sharpens the boundaries between different speech pattern clusters. Using clusters with sharp boundaries improves the likelihood of correctly recognizing correct speech patterns,” Atal, col. 3, lines 52-58. And, “The computer converts the audio digital signals into a plurality of vectors in n-dimensional space. Each vector is transformed using singular value decomposition into a spherical shape. The computer compares a first distance from a center of the n-dimensional space to a point associated with a stored speech phoneme with a second distance from the center of the n-dimensional space to a point associated with the received speech phoneme,” Atal, col. 3, line 67 – col. 4, line 7.); and

determining in which phoneme region of an embedding space the phoneme vector exists (“The computer compares a first distance from a center of the n-dimensional space to a point associated with a stored speech phoneme with a second distance from the center of the n-dimensional space to a point associated with the received speech phoneme,” Atal, col. 4, lines 2-7.); and 
determining a distance between the phoneme vector and the midpoint value of the phoneme region, wherein the distance indicates a degree of incorrectness of a pronunciation of the phoneme (“The lexical constraints embodied in the pronunciation of words make it possible to recognize words in the presence of mis-recognized phonemes. For example, the word "lessons" with /l eh s n z/as the pronunciation could be recognized as /l ah s ah z/ with two errors, the phonemes /eh/ and /n/ mis-recognized as /ah/ and /ah/, respectively. Accurate word recognition can be achieved by finding 4 closest phonemes, not just the closest one in comparing distances,” Atal, col. 10, lines 59-67. Here, mis-recognition is interpreted as a mis-pronunciation.).
 
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 2 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20180268806, hereinafter referred to as Chun et al., in view of US 20040230432, hereinafter referred to as Liu et al.
Regarding claim 2, Chun et al. discloses the method of claim 1, but not further comprising training a separate model for vowels in the phonetic alphabet and a separate model for consonants in the phonetic alphabet. Liu et al. is cited to disclose training a separate model for vowels in the phonetic alphabet and a separate model for consonants in the phonetic alphabet (“A second aspect of the invention is directed to a method of training audio classification models. The method comprises receiving a training audio signal and receiving phoneme classes corresponding to the training audio signal. A first Hidden Markov Model (HMM) is trained based on the training audio signal and the phoneme classes. The first HMM classifies speech as belonging to a vowel class when the first HMM determines that the speech corresponds to a sound represented by a set of phonemes that define vowels. A second HMM is trained based on the training audio signal and the phoneme classes. The second HMM classifies speech as belonging to a fricative class when the second HMM determines that the speech corresponds to a sound represented by a set of phonemes that define consonants,” Liu et al., para [0015].). Liu et al. benefits Chun et al. by more efficiently classifying segments of an audio signal such that the computational burden associated with a complete phoneme is reduced when attempting to process speech in real-time (Liu et al., para [0011]-[0012]). Therefore, it would be obvious for one skilled in the art to combine the teachings of Chun et al. with those of Liu et al. to improve the speech autoencoder of Chun et al. 

Claim 5 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20180268806, hereinafter referred to as Chun et al., in view of US 20130130212, hereinafter referred to as Dohring et al.

Regarding claim 5, Chun et al. discloses the method of claim 4, but not wherein applying the model comprises: 

determining (a) for each phoneme within the word, (b) for the average location of that phoneme in the space, (c) what region of the borrowing language contains that average phoneme location (“The comparison of the first distance with the second distance is illustrated in FIG. 13. This figure shows geometrically the comparison of distances from 5 stored phonemes to a received phoneme (260) in a hypershere,” Atal, col. 13, lines Atal benefits Chun et al. by representing both stored and received phoneme segments in high-dimensional space and transform the phoneme representation into a hyperspherical shape, thereby improving the likelihood of correctly recognizing correct speech pattern (Atal, col. 3, lines 42-58). Therefore, it would be obvious for one skilled in the art to combine the teachings of Chun et al. with those of Atal to improve the speech recognition of Chun et al.


Claims 6-7 and 9-11 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20180268806, hereinafter referred to as Chun et al., in view of US 20130130212, hereinafter referred to as Dohring et al.

Regarding claim 6, Chun et al. discloses a method of rendering speech (“Methods, systems, and computer-readable media for text-to-speech synthesis using an autoencoder,” Chun et al., Abstract.) 

extracting a segment of speech audio (“In the illustrated example, the TTS system 102 obtains a training example 106, which includes a linguistic label 106a and associated audio data 106b. For example, the label 106a indicates that the audio data 106b represents an "/e/" phone. In some implementations, the TTS system 102 may extract examples representing individual linguistic units from longer audio segments,” Chun et al., [0031].); and

During stage (D), the linguistic encoder 114 outputs an embedding 118a in response to the linguistic unit identifier 108. The acoustic encoder 116 outputs an embedding 118b in response to the acoustic feature vectors 110. Embeddings 118a and 118b can be the same size as each other, and can be the same size for all linguistic units and lengths of audio data. For example, the embeddings 118a and 118b may be 32-bit vectors,” Chun et al., para [0039]. Each vector representing an acoustic feature corresponds to a dimension.).
Chun et al., though, does not disclose rendering the speech on a 2D display; determining a location of the phoneme on a language-specific user interface; and projecting the phoneme vector in the embedding space to a display vector in the 2D space.
Dohring et al. is cited to disclose rendering the speech on a 2D display (Dohring et al., fig. 3. The user may play the phonetic sound via the 2D display.);   
determining a location of the phoneme on a language-specific user interface (Dohring et al., fig. 3. The location of the phoneme on the language-specific user interface is highlighted (thereby determined).); and 
projecting the phoneme vector in the embedding space to a display vector in the 2D space (Dohring et al., fig. 3.). Dohring et al. benefits Chun et al. by providing a language learner with an interface for practicing phoneme pronunciation (Dohring et al., para [0006]), thereby extending the speech synthesis application of Chun et al. Therefore, it would be obvious for one skilled in the art to combine the teachings of Chun et al. with those of Dohring et al. to enhance the speech autoencoder of Chun et al.

Regarding claim 7, Chun et al., as modified by Dohring et al., discloses the method of claim 6 further comprising: 

repeating for a string of consecutive speech audio segments to extract a string of phoneme vectors (“In still further embodiments, interacting with a button activates an audio representation of the sound of the phoneme in the form of a model pronunciation voiceover. Continuing to refer to FIG. 1, in a particular embodiment, the software module for practicing each phoneme further includes access to a software module for practicing each phoneme in the context of the beginning, middle, and end of words of the target language,” Dohring et al., para [0043]  ); and 

animating the string of phoneme vectors on a 2D display (“In further embodiments, the software module for practicing phonemes and the software module for recording and comparing pronunciations are in communication such that the model pronunciations for comparison are coordinated with the phonemes and words currently practiced,” Dohring et al., para [0050]. Pronouncing the string of the displayed phonemes is a form of animation.).  

Regarding claim 9, Chun et al., as modified by Dohring et al., discloses the method of claim 6, wherein the language-specific user interface is for English (Dohring et al., para [0068]).  

Regarding claim 10, Chun et al., as modified by Dohring et al., discloses the method of claim 6, wherein the language-specific user interface is for Italian (Dohring et al., para [0068]).  

Regarding claim 11, Chun et al., as modified by Dohring et al., discloses the method of claim 6, wherein the placing is done in real-time (Dohring et al., para [0050].).

Claims 8 and 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20180268806, hereinafter referred to as Chun et al., in view of US 20130130212, hereinafter referred to as Dohring et al., and further in view of US 6632094, hereinafter referred to as Falcon et al.
Regarding claim 8, Chun et al., as modified by Dohring et al., discloses the method of claim 7, but not wherein each word in the string of phoneme vectors representing a word is animated together so that phonemes in a word are highlighted at the same time. Falcon et al. is cited to disclose wherein each word in the string of phoneme vectors representing a word is animated together so that phonemes in a word are highlighted at the same time (“At any time, except while a synchronized narration/text highlighting is in progress, a child can click on individual words in the story text displayed in text box 5 and hear them pronounced. The way in which the words are pronounced is controllable to be set in one of two modes. In the first mode, a word that has been selected by clicking the text is pronounced only once in whole word form. In the second mode, the selected word is first broken down at the phonetic level (individual phonic elements are highlighted and pronounced in sequence), and then presented as a whole word (highlighted and spoken synchronously),” Falcon et al., col. 8, lines 23-33.). Falcon et al. benefits Chun et al. by providing a reading readiness system for prereaders, thereby increasing early readers’ understanding of the relationship between written and oral language (Falcon et al., col. 3, lines 37-41) and extending the speech synthesis application of Chun et al. Therefore, it would be obvious for one skilled in the art to combine the teachings of Chun et al. with those of Falcon et al. to enhance the speech autoencoder of Chun et al.

Regarding claim 12, Chun et al., as modified by Dohring et al., discloses the method of claim 6, but not wherein the phonemes are highlighted for different durations, in accordance with their duration in the speech audio. Falcon et al. is cited to disclose wherein the phonemes are highlighted for different durations, in accordance with their duration in the speech audio (“At any time, except while a synchronized narration/text highlighting is in progress, a child can click on individual words in the story text displayed in text box 5 and hear them pronounced. The way in which the words are pronounced is controllable to be set in one of two modes. In the first mode, a word that has been selected by clicking the text is pronounced only once in whole word form. In the second mode, the selected word is first broken down at the phonetic level (individual phonic elements are highlighted and pronounced in sequence), and then presented as a whole word (highlighted and spoken synchronously),” Falcon et al., col. 8, lines 23-33.). Falcon et al. benefits Chun et al. by providing a reading readiness system for prereaders, thereby increasing early readers’ understanding of the relationship between written and oral language (Falcon et al., col. 3, lines 37-41) and extending the speech synthesis application of Chun et al. Therefore, it would be obvious for one skilled in the art to combine the teachings of Chun et al. with those of Falcon et al. to enhance the speech autoencoder of Chun et al.

Claims 14-18 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 7369993, hereinafter referred to as Atal, in view of US 20130130212, hereinafter referred to as Dohring et al.
Regarding claim 14, Atal, as modified by Dohring et al., discloses the method of claim 13, but not wherein the determining is done in real time. Dohring et al. is cited to disclose wherein the determining is done in real time (Dohring et al., para [0050].). Dohring et al. benefits Chun et al. by providing a language learner with an interface for practicing phoneme pronunciation (Dohring et al., para [0006]), thereby extending the speech synthesis application of Chun et al. Therefore, it would be obvious for one skilled in the art to combine the teachings of Chun et al. with those of Dohring et al. to enhance the speech autoencoder of Chun et al.

Regarding claim 15, Atal, as modified by Dohring et al., discloses the method of claim 13, but not wherein the segment of audio speech is received by way of an app accessing a microphone in a mobile device. Dohring et al. is cited to disclose wherein the segment of audio speech is received by way of an app accessing a microphone in a mobile device (“In some embodiments, the software module for recording a language learner's pronunciation accesses a microphone associated with the digital processing device. In further embodiments, the microphone is integrated with the processing device. In other embodiments, the microphone is reversibly, but operably connected to the processing device. In still further embodiments, the software module uses APIs of the operating system, a web browser, or another software application to communicate with a microphone associated with the processing device,” Dohring et al., para [0061].). Dohring et al. benefits Chun et al. by providing a language learner with an interface for practicing phoneme pronunciation (Dohring et al., para [0006]), thereby extending the speech synthesis application of Chun et al. Therefore, it would be obvious for one skilled in the art to combine the teachings of Chun et al. with those of Dohring et al. to enhance the speech autoencoder of Chun et al. 

Regarding claim 16, Atal, as modified by Dohring et al., discloses the method of claim 13, but not further comprising: displaying a user interface comprising a plurality of phonemes representing the speech audio and indicating incorrect pronunciation of the speech audio.
Dohring et al. is cited to disclose displaying a user interface comprising a plurality of phonemes representing the speech audio and indicating incorrect pronunciation of the speech audio (“In further embodiments, spoken word or voiceover audio is used to alert the learner of an incorrect response and provide an example of a more correct response. In some embodiments, spoken word or voiceover audio is used to encourage a learner,” Dohring et al., para [0058]. And, Dohring et al., para [0056] describes display Dohring et al. benefits Chun et al. by providing a language learner with an interface for practicing phoneme pronunciation (Dohring et al., para [0006]), thereby extending the speech synthesis application of Chun et al. Therefore, it would be obvious for one skilled in the art to combine the teachings of Chun et al. with those of Dohring et al. to enhance the speech autoencoder of Chun et al. 

Regarding claim 17, Atal, as modified by Dohring et al., discloses the method of claim 13, but not wherein a human speaker is prompted to enter the segment of audio speech in a freestyle mode. Dohring et al. is cited to disclose wherein a human speaker is prompted to enter the segment of audio speech in a freestyle mode (“In further embodiments, an auditory representation includes, by way of non-limiting examples, a recorded model pronunciation of a phoneme, a word, a sentence, or a conversation, and a computer generated pronunciation of a phoneme, a word, a sentence, or a conversation,” Dohring et al., para [0041].). Dohring et al. benefits Chun et al. by providing a language learner with an interface for practicing phoneme pronunciation (Dohring et al., para [0006]), thereby extending the speech synthesis application of Chun et al. Therefore, it would be obvious for one skilled in the art to combine the teachings of Chun et al. with those of Dohring et al. to enhance the speech autoencoder of Chun et al.
 
Regarding claim 18, Atal, as modified by Dohring et al., discloses the method of claim 13, but not further comprising: displaying a user interface comprising a transcription of the speech audio and indicating incorrect pronunciation of the speech audio.

Dohring et al. is cited to disclose displaying a user interface comprising a transcription of the speech audio and indicating incorrect pronunciation of the speech audio (“In further embodiments, spoken word or voiceover audio is used to alert the learner of an incorrect response and provide an example of a more correct response. In some embodiments, spoken word or voiceover audio is used to encourage a learner,” Dohring et al., para [0058]. And, Dohring et al., para [0056] describes display of text.). Dohring et al. benefits Chun et al. by providing a language learner with an interface for practicing phoneme pronunciation (Dohring et al., para [0006]), thereby extending the speech synthesis application of Chun et al. Therefore, it would be obvious for one skilled in the art to combine the teachings of Chun et al. with those of Dohring et al. to enhance the speech autoencoder of Chun et al.


Conclusion
Other related prior art are listed in the attached PTO-892. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ANNE L THOMAS-HOMESCU whose telephone number is (571)272-0899.  The examiner can normally be reached on Mon-Fri 8-6.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 5712727453.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.



/ANNE L THOMAS-HOMESCU/Primary Examiner, Art Unit 2659