Notice of Pre-AIA  or AIA  Status

The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Specification

The title of the invention is not descriptive.  A new title is required that is clearly indicative of the invention to which the claims are directed.  The title should reflect the particular speech modeling in the claims, and not generic ‘analysis’ of a speech signal. 

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

Claim Rejections - 35 USC § 102

The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.


Claim(s) 8, 15, 17-20, 22, 24-26 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Chun et al (20180268806).

As per claim 8, Chun et al (20180268806) teaches a computer-implemented method for estimating aspects of voice data (as estimating acoustic and linguistic voice parameters – abstract, Fig.1a), the method comprising: 
learning a deep generation model, wherein the deep generation model (as using neural networks as part of an autoencoder – para 0011) comprises: 
an encoder estimating a first parameter (as using an autoencoder for linguistic and acoustic estimation – para 0079) included in a first fundamental frequency pattern of speech signal in a first voice data as a latent variable (as analyzing and matching fundamental frequencies in a first candidate voice for diphones – para 0073; and operating on acoustic features derived from stored utterance/speech signals – para 0031wherein the acoustic features are derived from stored speech signals) of the deep generation model (as part of the consecutive layers – para 0073, wherein the layers are part of a neural network/deep generation model – para 0077)  using parallel data between the first fundamental frequency pattern of speech signal in the first voice data and the first parameter included in the first fundamental frequency pattern of speech signal in the first voice data (as, parallel data of the fundamental frequency for the diphones pairs – para 0073, see also Fig. 1b, subblocks 132a/b showing the parallel layers and lattice structure in 138a/b/c/n; and operating on acoustic features derived from stored utterance/speech signals – para 0031wherein the acoustic features are derived from stored speech signals), 
and a decoder reconstructing the first fundamental frequency pattern of speech signal in the first voice data based on the latent variable of the deep generation model (as decoding – para 0078, and para 0051 showing the decoder performing on the same variables, ie fundamental frequency, as the encoder); 
estimating, based on a second fundamental frequency pattern of speech signal in a second voice data, a second parameter included in the second fundamental frequency pattern of speech signal using the encoder of the deep generation model; and estimating, based on a third parameter included in a third fundamental frequency pattern in a third voice data, the third fundamental frequency pattern using the decoder of the deep generation model (Examiner notes that applicants specification details a single fundamental frequency, throughout; and that the claimed ‘second fundamental frequency’ and ‘third fundamental frequency’ is essentially the fundamental frequency for a second and third voice data, and NOT a second/third fundamental frequency of a singular set of voice data; under this definition (via applicants specification), Chun et al (20180268806) teaches repeating the finding of a fundamental frequency for every phone label/acoustic unit – fig. 1a, subblocks 106a + 106b, tracing through the processing D,E,F,G, as detailed partially in para 0073, fundamental frequencies F0).


As per claim 15, Chun et al (20180268806) teaches a system for estimating aspects of voice data (as estimating acoustic and linguistic voice parameters – abstract, Fig.1a), the system comprises: 
a processor; and a memory storing computer-executable instructions that when executed by the processor cause the system to (as processor, memory, and computer instructions – para 0134):
learn a deep generation model, wherein the deep generation model (as using neural networks as part of an autoencoder – para 0011) comprises: an encoder estimating a first parameter (as using an autoencoder for linguistic and acoustic estimation – para 0079) included in a first fundamental frequency pattern of speech signal in a first voice data as a latent variable (as analyzing and matching fundamental frequencies in a first candidate voice for diphones – para 0073) of the deep generation model (as part of the consecutive layers – para 0073, wherein the layers are part of a neural network/deep generation model – para 0077) using parallel data between the first fundamental frequency pattern of speech signal in the first voice data and the first parameter included in the first fundamental frequency pattern of speech signal in the first voice data (as, parallel data of the fundamental frequency for the diphones pairs – para 0073, see also Fig. 1b, subblocks 132a/b showing the parallel layers and lattice structure in 138a/b/c/n), 
and a decoder reconstructing the first fundamental frequency pattern of speech signal in the first voice data based on the latent variable of the deep generation model (as decoding – para 0078, and para 0051 showing the decoder performing on the same variables, ie fundamental frequency, as the encoder); 
estimate, based on a second fundamental frequency pattern in a second voice data, a second parameter included in the second fundamental frequency pattern using the encoder of the deep generation model; and estimate, based on a third parameter included in a third fundamental frequency pattern in a third voice data, the third fundamental frequency pattern using the decoder of the deep generation model (Examiner notes that applicants specification details a single fundamental frequency, throughout; and that the claimed ‘second fundamental frequency’ and ‘third fundamental frequency’ is essentially the fundamental frequency for a second and third voice data, and NOT a second/third fundamental frequency of a singular set of voice data; under this definition (via applicants specification), Chun et al (20180268806) teaches repeating the finding of a fundamental frequency for every phone label/acoustic unit – fig. 1a, subblocks 106a + 106b, tracing through the processing D,E,F,G, as detailed partially in para 0073, fundamental frequencies F0).

As per claim 17, Chun et al (20180268806) teaches the system of claim 15, wherein each of the encoder and the decoder is configured using a convolutional neural network (as the use of general neural networks – para 0010, 0011 – examiner notes that it is old and notoriously well known in the art of neural networks to use convolutional neural networks for applications).

As per claim 18, Chun et al (20180268806) teaches the system of claim 15, wherein the first voice data is a learning data, wherein the first voice data and the second voice data are distinct, and wherein the first voice data and the third voice data are distinct (see Chun et al (20180268806), teaches repeating the finding of a fundamental frequency for every phone label/acoustic unit – fig. 1a, subblocks 106a + 106b, tracing through the processing D,E,F,G, as detailed partially in para 0073, fundamental frequencies F0).

As per claim 19, Chun et al (20180268806) teaches the system of claim 15, wherein the first fundamental frequency pattern of speech signal of the first voice data relates to one or more of: an interrogative sentence based on the ending of an utterance sentence, an intention of a speaker represented by the first voice data, a melody of a singer represented by the first voice data, and an emotion of the singer represented by the first voice data (examiner notes that the claim language is in the alternative, ie “one or more of…”, and hence only one element needs to be met; Chun et al (20180268806) teaches “intention of a speaker” – as, the linguistic processor choses phones that are context-dependent – ie, phones representing the context/intent of what is spoken – para 0107)..

As per claim 20, Chun et al (20180268806) teaches the system of claim 15, wherein the first parameter in the first fundamental frequency pattern of the first voice data represents one or more of: an accent of voice in the first voice data, and vibrato and overshoot in the first voice data as a singing voice (examiner notes that the claim language is in the alternative, ie “one or more of…”, and hence only one element needs to be met; Chun et al (20180268806) teaches voice accent, as the input can be a command – which would be a declarative (para 0141) or a query –which would be a question-type accent – para 0106 –e.g., where is….?).

Claims 22, 24-26 are computer readable medium claims whose structure and steps are performed by system claims 15,17-20 above and as such, claims 22,24-26 are similar in scope and content to system claims 15,17-20; therefore, claims 22, 24-26 are rejected under similar rationale as presented against claims 15,17-20 above.

Claim Rejections - 35 USC § 103

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 9-13, 16, 23 are rejected under 35 U.S.C. 103 as being unpatentable over Chun et al (20180268806) in view of Kameoka et al (“Generative Modeling of Voice Fundamental Frequency Contours, IEEE/ACM transactions on audio/speech/language processing, pp 1042-1053, 2015).

As per claim 9, Chun et al (20180268806) teaches the computer-implemented method of claim 8 (as mapped above to claim 8), and teaches an output of the encoder having the latent variable as an input of the decoder (as decoding – para 0078, and para 0051 showing the decoder performing on the same variables, ie fundamental frequency, as the encoder).
 but does not explicitly teach the use of a path restricted HMM in calculating the distance between the desired output and the initial F0 (as claimed, the method further comprising: maximizing an output of an objective function for learning the deep generation model, wherein the objective function is based at least on: a distance between an output of the decoder having the first fundamental frequency pattern of the first voice data as an input and a prior distribution of the first parameter represented using a state sequence of a path-restricted hidden Markov model (HMM)); however, Kameoka et al (“Generative Modeling of Voice…”) teaches the use of a Markov Model path restricting (Fig. 3, reflecting back on Fig. 2), operating on the F0 fundamental frequency (see pp 1043, second column, starting with “Generative Model of Voice F0 contours; and pp 1044, ).  Therefore, it would have been obvious to one of ordinary skill in the art of regenerative models/neural networks operating on fundamental frequency data of voice information to enhance the output of the acoustic model of Chun et al (20180268806) with an additional step of HMM path restriction, as taught by Kameoka et al (“Generative Modeling of Voice…”), because it would advantageously improve upon the accuracy and efficiency of estimating the F0 parameter (see Kameoka et al (“Generative Modeling of Voice…”), pp 1052, conclusion paragraph).     

As per claim 10, Chun et al (20180268806) in view of Kameoka et al (“Generative Modeling of Voice…) teaches the computer-implemented method of claim 9, wherein each of the encoder and the decoder is configured using a convolutional neural network (Chun et al (20180268806), as the use of general neural networks – para 0010, 0011 – examiner notes that it is old and notoriously well known in the art of neural networks to use convolutional neural networks for applications).

As per claim 11, the combination of Chun et al (20180268806) in view of Kameoka et al (“Generative Modeling of Voice…) teaches the computer-implemented method of claim 9, wherein the first voice data is a learning data, wherein the first voice data and the second voice data are distinct, and wherein the first voice data and the third voice data are distinct (see Chun et al (20180268806), teaches repeating the finding of a fundamental frequency for every phone label/acoustic unit – fig. 1a, subblocks 106a + 106b, tracing through the processing D,E,F,G, as detailed partially in para 0073, fundamental frequencies F0) .

As per claim 12, the combination of Chun et al (20180268806) in view of Kameoka et al (“Generative Modeling of Voice…) teaches the computer-implemented method of claim 9, wherein the first fundamental frequency pattern of speech signal of the first voice data relates to one or more of: an interrogative sentence based on the ending of an utterance sentence, an intention of a speaker represented by the first voice data, a melody of a singer represented by the first voice data, and an emotion of the singer represented by the first voice data (examiner notes that the claim language is in the alternative, ie “one or more of…”, and hence only one element needs to be met; Chun et al (20180268806) teaches “intention of a speaker” – as, the linguistic processor choses phones that are context-dependent – ie, phones representing the context/intent of what is spoken – para 0107).

As per claim 13, the combination of Chun et al (20180268806) in view of Kameoka et al (“Generative Modeling of Voice…) teaches the computer-implemented method of claim 9, wherein the first parameter in the first fundamental frequency pattern of speech signal of the first voice data represents one or more of: an accent of voice in the first voice data, and vibrato and overshoot in the first voice data as a singing voice (examiner notes that the claim language is in the alternative, ie “one or more of…”, and hence only one element needs to be met; Chun et al (20180268806) teaches voice accent, as the input can be a command – which would be a declarative (para 0141) or a query –which would be a question-type accent – para 0106 –e.g., where is….?).

	Claims 16 and 23 are computer readable instructions claims performing the steps of method claim 9 and as such, claims 16 and 23 are similar in scope and content to claim 9 above and therefore, claims 16 and 23 are rejected under similar rationale as presented against claim 9 above.  Furthermore, Chun et al (20180268806) teaches processor, memory, and computer instructions – para 0134. 


Claims 14, 21, 27 are rejected under 35 U.S.C. 103 as being unpatentable over Chun et al (20180268806) in view of Kameoka et al (“Generative Modeling of Voice Fundamental Frequency Contours, IEEE/ACM transactions on audio/speech/language processing, pp 1042-1053, 2015), in further view of Kato (20090204395). 

As per claim 14, the combination of Chun et al (20180268806) in view of Kameoka et al (“Generative Modeling of Voice…) teaches the computer-implemented method of claim 9, the method further comprising: receiving the first voice data; receiving the first fundamental frequency pattern of speech signal based on the received providing the synthesized voice data (as generating decoded synthesized voice data – Chun et al, para 0063, 0066).  Chun et al (20180268806), as part of the combination of Chun et al (20180268806) in view of Kameoka et al (“Generative Modeling of Voice…) teaches the operations on a fundamental frequency F0 among other acoustic features – para 0041, on a plethora of speech units – para 0030, as noted above; but does not explicitly teach operating on a singing voice with parameters relating to vibrato/overshoot; however, Kato et al (20090204395) teaches generating voice characteristics different from normal utterances, such as singing voices in the vibrato range (para 0001) and measuring fundamental frequencies in these types of utterances – para 0208.  Therefore, it would have been obvious to one of ordinary skill in the art of synthesizing voices to further specify the system of Chun et al as part of the combination of Chun et al (20180268806) in view of Kameoka et al (“Generative Modeling of Voice…) with operational functions to handle singing voices and accompanying vibrato information, as taught by Kato (20090204395), because it would advantageously allow for the user to specify a type of desired voice output (Kato, para 0210).
	Claims 21,27 are computer readable executable instruction claims that performs the steps of claim 14 above and as such, claims 21,27 are similar in scope and content to claim 14 above and therefore, claims 21,27 are rejected under similar rationale as presented against claim 14 above.  Furthermore, Chun et al teaches a processor, memory, and computer instructions – para 0134).

Response to Arguments

Applicant's arguments filed 7/28/2022 have been fully considered but they are not persuasive.  On pp 10 to middle of pp11 of the response, applicants emphasize certain elements of the claim scope, include the amended “speech signal” claim elements; examiner notes the new recitations to the Chun reference, explaining how the acoustic features, including fundamental frequencies, are extracted from audio data, said audio data originating from speech/voice utterances/signals.  From the middle of pp 11 of the response, to the bottom of pp 12, applicants explain the workings of the Chun reference.  Included in this explanation, applicant proffers “Chun does not estimate a parameter included in a fundamental frequency of speech signal…first fundamental frequency pattern” – pp 11; and “Indeed, matching characteristics…has nothing to do with estimating the first parameter…frequency pattern of speech signal…deep generation model” – pp 12. This is followed by, on the top half of pp13, a generalized statement, that “the systems and methods disclosed by Chun are fundamentally different from the present systems.  Chun does not anticipate claim 8”.  Examiner disagrees and 1) repeats the mappings of the Chun reference above, and namely, the showing of Chun operating on fundamental frequency acoustic features derived from utterances/speech signals and 2) Chun teaches cost minimization by matching acoustic features, such as fundamental frequencies, of utterance derived diphones (para 0073) and 3) Applicant's arguments do not comply with 37 CFR 1.111(c) because although applicants arguments repeat the claim scope and the cited references to Chun, there is no explanation of any possible differential, other than “has nothing to do with”, “fundamentally different”, “does not anticipate”.  A similar pattern is repeated, on the arguments presented on pp 13-15 of the response.  Examiner disagrees with the general allegations, and repeats the mappings presented to the prior art references, as noted above.  

Conclusion

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Representative prior art toward Autoencoders Using Neural Networks:
Bellegarda (9,842,105)
(113) As described, each of the phrases provided by the concatenation unit 504 may have a relatively high dimensionality as a result of concatenation of word representations. Accordingly, in some instances, the autoencoder 506 adjusts (e.g., reduces) dimensionality of each of the phrases to provide the plurality of encoded phrases. Adjusting dimensionality of a phrase in this manner may include reducing a dimensionality of the phrase and reconstructing the phrase. In some examples, the autoencoder 506 may be implemented using a neural network, such as a recurrent neural network or a feedforward neural network. Accordingly, the autoencoder 506 may reduce the dimensionality of a phrase using a first weight factor (e.g., a weight matrix) and reconstruct the phrase using a second weight factor. In some examples, the autoencoder 506 may reduce dimensionality of the phrase using multiple weight factors and/or may reconstruct the phrase using multiple weight factors. By reducing dimensionality and reconstructing in this manner, the autoencoder may provide encoded phrases with minimal reconstruction errors.
(113) As described, each of the phrases provided by the concatenation unit 504 may have a relatively high

Representative prior art toward neural networks with HMM restrictive paths – 
Mishra (20180144746), para 0110-0113 – showing neural networks and markov chains implemented in unison.
Parthasarathi et al (20170270919), 34) Thus, the wakeword detection module 220 may compare audio data to stored models or data to detect a wakeword. One approach for wakeword detection applies general large vocabulary continuous speech recognition (LVCSR) systems to decode the audio signals, with wakeword searching conducted in the resulting lattices or confusion networks. LVCSR decoding may require relatively high computational resources. Another approach for wakeword spotting builds hidden Markov models (HMM) for each key wakeword word and non-wakeword speech signals respectively. The non-wakeword speech includes other spoken words, background noise etc. There can be one or more HMMs built to model the non-wakeword speech characteristics, which are named filler models. Viterbi decoding is used to search the best path in the decoding graph, and the decoding output is further processed to make the decision on keyword presence. This approach can be extended to include discriminative information by incorporating hybrid DNN-HMM decoding framework. In another embodiment the wakeword spotting system may be built on deep neural network (DNN)/recurrent neural network (RNN) structures directly, without HMM involved. Such a system may estimate the posteriors of wakewords with context information, either by stacking frames within a context window for DNN, or using RNN. Following-on posterior threshold tuning or smoothing is applied for decision making. Other techniques for wakeword detection, such as those known in the art, may also be used.



Any inquiry concerning this communication or earlier communications from the examiner should be directed to Michael Opsasnick, telephone number (571)272-7623, who is available Monday-Friday, 9am-5pm. 
If attempts to reach the examiner by telephone are unsuccessful, the examiner's supervisor, Mr. Richemond Dorvil, can be reached at (571)272-7602.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).

/Michael N Opsasnick/Primary Examiner, Art Unit 2658                                                                                                                                                                                                        11/02/2022