DETAILED ACTION

This communication is in response to the Application filed on 29 January 2020. Claims 1-20 are pending and have been examined.
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The two information disclosure statements (IDS) and one IDS submitted on 28 June 2020 and 03 December 2019, respectively, are in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.


Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 4, 6-7, and 15-16 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20200258496, hereinafter referred to as Yang et al., in view of US 20200250794, hereinafter referred to as Zimmer et al.

Regarding claim 1, Yang et al. discloses a speech conversion system (“A method of performing speech synthesis, includes encoding character embeddings, using any one or any combination of convolutional neural networks (CNNs) and recurrent neural networks (RNNs), applying a relative-position-aware self attention function to each of the character embeddings and an input mel-scale spectrogram, and encoding the character embeddings to which the relative-position-aware self attention function is applied,” Yang et al., Abstract.), comprising: 

a processor (Yang et al., para [0007]); and 

memory storing instructions executable by the processor (Yang et al., para [0007]), the instructions comprising, to: 

using a second recurrent neural network (RNN) (GRU1) and a first set of encoder vectors derived from a spectrogram as input to the second RNN (“Character embeddings 450 related to at least one linguistic feature are used as input to predict a mel-scale spectrogram 460 related to at least one acoustic feature that is output,” Yang et al., para [0041]. And, “In operation 520, the method 500 includes applying a relative-position-aware self attention function to each of the character embeddings and an input mel-scale spectrogram,” Yang et al., para [0046].), determine a second concatenated sequence (“In operation 540, the method 500 includes concatenating the encoded character embeddings and the encoded character embeddings to which the relative-position-aware self attention function is applied, to generate an encoder output,” Yang et al., para [0051].); 

using the second set of encoder vectors, determine a third set of encoder vectors (“The program code includes first encoding code configured to cause the at least one processor to encode character embeddings, using any one or any combination of convolutional neural networks (CNNs) and recurrent neural (RNNs), first applying code configured to cause the at least one processor to apply a relative-position-aware self attention function to each of the character embeddings and an input mel-scale spectrogram, and second encoding code configured to cause the at least one processor to encode the character embeddings to which the relative-position-aware self attention function is applied,” Yang et al., para [0007]. And, “FIG. 7 is a diagram of an apparatus 700 for performing speech synthesis, according to embodiments. As shown in FIG. 7, the apparatus 700 includes first encoding code 710, first applying code 720, second encoding code 730, concatenating code 740, second applying code 750 and predicting code 760,” Yang et al., [0075].); and 

decode the third set of encoder vectors using an attention block (“The hybrid architecture includes a multi-tower hybrid encoder 420, an N-block self-attention-based decoder 430, and a multi-head attention 440 to connect the encoder 420 and the decoder 430,” Yang et al., para [0041]. And, “The method 500 may further include applying a layer normalization, a sequence of feed forward transformations, and a learned linear transformation to the encoder output and the input mel-scale spectrogram to which the multi-head attention function is applied, to generate a decoder output, and the predicting the output mel-scale spectrogram may include predicting the output mel-scale spectrogram, based on the decoder output,” Yang et al., para [0070].).
  
Yang et al., though, does not disclose determining a second set of encoder vectors by doubling a stack height and halving a length of the second concatenated sequence 

Zimmer et al. is cited to disclose determining a second set of encoder vectors by doubling a stack height and halving a length of the second concatenated sequence (“The encoder network contains multiple convolutional layers (denoted Conv and having the same types of parameters as the ones described in reference to FIG. 5), alternating with down-sampling (max-pooling, e.g. 2.times.2 max-pooling with a stride of 2) layers, stacked on top of each other, resulting in feature maps of halving size and doubling number,” Zimmer et al., para [0267].). Zimmer et al. benefits Yang et al. by improving processing time (Zimmer et al., para [0016]). Therefore, it would be obvious for one skilled in the art to combine the teachings of Yang et al. with those of Zimmer et al. to make the speech synthesis of Yang et al. more efficient. 
As to claim 15, method claim 15 and system claim 1 are related as system and method of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 15 is similarly rejected under the same rationale as applied above with respect to method claim. 

claim 2, Yang et al., as modified by Zimmer et al., discloses the system of claim 1, wherein the instructions further comprise to, prior to determining the second concatenated sequence: 

using a first RNN (GRUO) and a plurality of preprocessed encoder vectors as input to the first RNN, determine a first concatenated sequence (“The method further includes concatenating the encoded character embeddings and the encoded character embeddings to which the relative-position-aware self attention function is applied, to generate an encoder output,” Yang et al., para [0006]. Also, “In operation 540, the method 500 includes concatenating the encoded character embeddings and the encoded character embeddings to which the relative-position-aware self attention function is applied, to generate an encoder output,” Yang et al., para [0051]. And, “FIG. 7 is a diagram of an apparatus 700 for performing speech synthesis, according to embodiments. As shown in FIG. 7, the apparatus 700 includes first encoding code 710, first applying code 720, second encoding code 730, concatenating code 740, second applying code 750 and predicting code 760,” Yang et al., para [0075].); and 

determine the first set of encoder vectors by doubling a stack height and halving a length of the first concatenated sequence (“The encoder network contains multiple convolutional layers (denoted Conv and having the same types of parameters as the ones described in reference to FIG. 5), alternating with down-sampling (max-pooling, e.g. 2.times.2 max-pooling with a stride of 2) layers, stacked on top of each other, resulting in feature maps of halving size and doubling number,” Zimmer et al., para [0267].).  
claim 16, method claim 16 and system claim 2 are related as system and method of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 16 is similarly rejected under the same rationale as applied above with respect to method claim. 

Regarding claim 4, Yang et al., as modified by Zimmer et al., discloses the system of claim 1, wherein the processor further uses a third RNN, wherein the third RNN receives, as input, the second set of encoder vectors and provides, as output, the third set of encoder vectors (“The program code includes first encoding code configured to cause the at least one processor to encode character embeddings, using any one or any combination of convolutional neural networks (CNNs) and recurrent neural (RNNs), first applying code configured to cause the at least one processor to apply a relative-position-aware self attention function to each of the character embeddings and an input mel-scale spectrogram, and second encoding code configured to cause the at least one processor to encode the character embeddings to which the relative-position-aware self attention function is applied,” Yang et al., para [0007]. And, “FIG. 7 is a diagram of an apparatus 700 for performing speech synthesis, according to embodiments. As shown in FIG. 7, the apparatus 700 includes first encoding code 710, first applying code 720, second encoding code 730, concatenating code 740, second applying code 750 and predicting code 760,” Yang et al., [0075].).  


Regarding claim 6, Yang et al., as modified by Zimmer et al., discloses the system of claim 1, wherein the spectrogram is a mel-spectrogram (“Character embeddings 450 related to at least one linguistic feature are used as input to predict a mel-scale spectrogram 460 related to at least one acoustic feature that is output,” Yang et al., para [0041]..).  


Regarding claim 7, Yang et al., as modified by Zimmer et al., discloses the system of claim 1, wherein the spectrogram comprises a plurality of concatenated vectors, wherein the spectrogram is a visual representation of a speech utterance (“Character embeddings 450 related to at least one linguistic feature are used as input to predict a mel-scale spectrogram 460 related to at least one acoustic feature that is output,” Yang et al., para [0041].).  


Claims 3, 5, 8-9, 11, 13-14, 17, and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20200258496, hereinafter referred to as Yang et al., in view of US 20200250794, hereinafter referred to as Zimmer et al., and further in view of “Tacotron: Towards End-to-End Speech Synthesis”, hereinafter referred to as Wang et al.

Regarding claim 3, Yang et al., as modified by Zimmer et al., discloses the system of claim 2, but not wherein the first and second RNNs are gated recurrent unit (GRUs) and each are bidirectional pass. Wang et al. is cited to disclose wherein the first and second RNNs are gated recurrent unit (GRUs) and each are bidirectional pass (“CBHG consists of a bank of 1-D convolutional filters, followed by highway networks (Srivastava et al., 2015) and a bidirectional gated recurrent unit (GRU) (Chung et al., 2014) recurrent neural net (RNN)…Finally, we stack a bidirectional GRU RNN on top to extract sequential features from both forward and backward context,” Wang et al., section 3.1.). Wang et al. benefits Yang et al. by using a bottleneck layer with dropout as a pre-net (i.e., a set of non-linear transformations) which helps convergence and improves generalization (Wang et al., section 3.2). Therefore, it would be obvious for one skilled in the art to combine the teachings of Yang et al. with those of Wang et al. to improve the speech synthesis of Yang et al. 

Regarding claim 5, Yang et al., as modified by Zimmer et al., discloses the system of claim 4, but not wherein the third RNN is a gated recurrent unit (GRU) and is bidirectional pass. Wang et al. is cited to disclose wherein the third RNN is a gated recurrent unit (GRU) and is bidirectional pass (“Finally, we stack a bidirectional GRU RNN on top to extract sequential features from both forward and backward context,” Wang et al., section 3.1.). Wang et al. benefits Yang et al. by using a bottleneck layer with dropout as a pre-net (i.e., a set of non-linear transformations) which helps convergence and improves generalization (Wang et al., section 3.2). Therefore, it would be obvious for one skilled in the art to combine the teachings of Yang et al. with those of Wang et al. to improve the speech synthesis of Yang et al. 


Regarding claim 8, Yang et al., as modified by Zimmer et al., discloses the system of claim 1, but not wherein the instructions further comprise to, prior to determining the second set of encoded vectors: 



using a first RNN (GRUO) and the plurality of preprocessed encoder vectors as input to the first RNN, determine the first set of encoder vectors.
 
Wang et al. is cited to disclose based on the input and using an encoder preprocessing neural network (PRENET) and a convolutional filter-banks and highways (CFBH) layer, determine a plurality of preprocessed encoder vectors (“The convolution outputs are fed into a multi-layer highway network to extract high-level features,” Wang et al., section 3.1. See also Wang et al., fig. 2. “We then apply a set of non-linear transformations, collectively called a "pre-net", to each embedding. We use a bottleneck layer with dropout as the pre-net in this work, which helps convergence and improves generalization. A CBHG module transforms the prenet outputs into the final encoder representation used by the attention module,” Wang et al., section 3.2.); and 

using a first RNN (GRUO) and the plurality of preprocessed encoder vectors as input to the first RNN, determine the first set of encoder vectors (Wang et al., fig. 2.). Wang et al. benefits Yang et al. by using a bottleneck layer with dropout as a pre-net (i.e., a set of non-linear transformations) which helps convergence and improves generalization (Wang et al., section 3.2). Therefore, it would be obvious for one skilled in the art to combine the teachings of Yang et al. with those of Wang et al. to improve the speech synthesis of Yang et al. 

claim 17, method claim 17 and system claim 8 are related as system and method of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 17 is similarly rejected under the same rationale as applied above with respect to method claim. 

Regarding claim 9, Yang et al., as modified by Zimmer et al., discloses the system of claim 1, but not wherein the instructions further comprise to: at the attention block, iteratively generate an attention context vector; and provide the attention context vector.

Wang et al. is cited to disclose wherein the instructions further comprise to: at the attention block, iteratively generate an attention context vector; and provide the attention context vector (Wang et al., fig. 1 shows how the attention block is continuously updated.); and 

provide the attention context vector (Wang et al., fig. 1 shows output of Attention block.). Wang et al. benefits Yang et al. by using a bottleneck layer with dropout as a pre-net (i.e., a set of non-linear transformations) which helps convergence and improves generalization (Wang et al., section 3.2). Therefore, it would be obvious for one skilled in the art to combine the teachings of Yang et al. with those of Wang et al. to improve the speech synthesis of Yang et al. 

Regarding claim 11, Yang et al., as modified by Zimmer et al., discloses the system of claim 1, but not wherein the instructions further comprise to: 

at the attention block: receive as input one of the third set of encoded vectors;

at the attention block: receive as input at least one of a set of decoder hidden vectors;

at the attention block: determine an attention context vector; and

provide the attention context vector.

Wang et al. is cited to disclose at the attention block: receive as input one of the third set of encoded vectors (Wang et al., fig. 1.); 

at the attention block: receive as input at least one of a set of decoder hidden vectors (Wang et al., fig. 1.); 

at the attention block: determine an attention context vector (Wang et al., fig. 1.); and 

provide the attention context vector (Wang et al., fig. 1). Wang et al. benefits Yang et al. by using a bottleneck layer with dropout as a pre-net (i.e., a set of non-linear transformations) which helps convergence and improves generalization (Wang et al., section 3.2). Therefore, it would be obvious for one skilled in the art to combine the teachings of Yang et al. with those of Wang et al. to improve the speech synthesis of Yang et al. 
As to claim 20, method claim 20 and system claim 11 are related as system and method of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 20 is similarly rejected under the same rationale as applied above with respect to method claim. 
 
claim 13, Yang et al., as modified by Zimmer et al., discloses the system of 74claim 1, but not wherein the instruction to decode further comprises to: 

determine a set of hidden decoder vectors by receiving as input, at an attention recurrent neural network (RNN), a first set of decoder vectors, wherein at least one of the first set of decoder vectors comprises a concatenation of an attention context vector and at least one of a plurality of preprocessed decoder vectors;

using a residual decoder stack and the set of hidden decoder vectors, determine a set of decoder output vectors;

feedback at least one of the set of decoder output vectors as input to a decoder preprocessing neural network (PRENET); and

use the decoder PRENET to determine and update the plurality of preprocessed decoder vectors.

Wang et al. is cited to disclose determining a set of hidden decoder vectors by receiving as input, at an attention recurrent neural network (RNN), a first set of decoder vectors (Wang et al., fig. 1 and table 1, Attention RNN block. And, “We use a content-based tanh attention decoder (see e.g. Vinyals et al. (2015)), where a stateful recurrent layer produces the attention query at each decoder time step. We concatenate the context vector and the attention RNN cell output to form the input to the decoder RNNs,” Wang et al., section 3.3.), wherein at least one of the first set of decoder vectors comprises a concatenation of an attention context vector and at least one of a plurality of preprocessed decoder vectors (Wang et al., table 1, Decoder pre-net.);  18Attorney Docket No. 84138346 (65080-3290) 

using a residual decoder stack and the set of hidden decoder vectors, determine a set of decoder output vectors (Wang et al., table 1, Decoder RNN. And, “We use a content-based tanh attention decoder (see e.g. Vinyals et al. (2015)), where a stateful recurrent layer produces the attention query at each decoder time step. We concatenate the context vector and the attention RNN cell output to form the input to the decoder RNNs,” Wang et al., section 3.3.); 

feedback at least one of the set of decoder output vectors as input to a decoder preprocessing neural network (PRENET) (Wang et al., fig. 1 shows feedback from the decoder output to the pre-net (i.e., dotted line).); and 

use the decoder PRENET to determine and update the plurality of preprocessed decoder vectors (Wang et al., fig. 1 shows feedback from the decoder output to the pre-net (i.e., dotted line). The feedback to the pre-net updates the preprocessed decoder vectors.). Wang et al. benefits Yang et al. by using a bottleneck layer with dropout as a pre-net (i.e., a set of non-linear transformations) which helps convergence and improves generalization (Wang et al., section 3.2). Therefore, it would be obvious for one skilled in the art to combine the teachings of Yang et al. with those of Wang et al. to improve the speech synthesis of Yang et al. 
 
Regarding claim 14, discloses the system of claim 13, wherein the instruction to decode further comprises to: in response to receiving an updated attention context vector, provide an updated at least one of the set of decoder output vectors to the decoder .  


Claim 10 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20200258496, hereinafter referred to as Yang et al., in view of US 20200250794, hereinafter referred to as Zimmer et al., further in view of “Tacotron: Towards End-to-End Speech Synthesis”, hereinafter referred to as Wang et al., and further in view of US 20200012953, hereinafter referred to as Sun et al.

Regarding claim 10, Yang et al., as modified by Zimmer et al. and Wang et al., discloses the system of claim 9, but not wherein the instructions further comprise to: 

determine a best match vector from among the third set of encoder vectors by comparing the third set of encoder vectors to a previous-best match vector; and

provide the attention block with the best match vector in order to determine an updated attention context vector.

Sun et al. is cited to disclose determining a best match vector from among the third set of encoder vectors by comparing the third set of encoder vectors to a previous-best match vector (“calculating matching degrees between input hidden states in the input hidden state sequence and a prediction state of the target position in the to-be-generated prediction state sequence based on the current hidden state,” Sun et al., para [0095].); and 

calculating an attention weight of each of the input hidden states on the prediction state of the target position based on the matching degrees; performing a weighted sum of the input hidden states according to the attention weights to obtain a context vector,” Sun et al., para [0095].). Sun et al. benefits Yang et al. by providing a means for providing the best match encoded vector to the attention block (Sun et al., para [0095]). Therefore, it would be obvious for one skilled in the art to combine the teachings of Yang et al. with those of Sun et al. to improve the speech synthesis method of Yang et al. 
As to claim 19, method claim 19 and system claim 10 are related as system and method of using same, with each claimed element’s function corresponding to the method step. Accordingly claim 19 is similarly rejected under the same rationale as applied above with respect to method claim. 

Claim 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over US 20200258496, hereinafter referred to as Yang et al., in view of US 20200250794, hereinafter referred to as Zimmer et al., and further in view of US 20200012953, hereinafter referred to as Sun et al.

Regarding claim 12, Yang et al., as modified by Zimmer et al., discloses the system of claim 1, but not specifically stating wherein the third set of encoded vectors are a set of hidden encoder vectors.

constructing an input sequence based on the sample sentence; mapping the input sequence to an input hidden state sequence using the encoder,” Sun et al., para [0044].). Sun et al. benefits Yang et al. by providing a means for providing the best match encoded vector to the attention block (Sun et al., para [0095]). Therefore, it would be obvious for one skilled in the art to combine the teachings of Yang et al. with those of Sun et al. to improve the speech synthesis method of Yang et al. 


Conclusion
Other related prior art are listed in the attached PTO-892. Of particular interest are Lee et al. and Bahdanau et al., both of which describe methods of neural machine translation.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ANNE L THOMAS-HOMESCU whose telephone number is (571)272-0899.  The examiner can normally be reached on Mon-Fri 8-6.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 5712727453.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.



/ANNE L THOMAS-HOMESCU/Primary Examiner, Art Unit 2659