Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Allowable Subject Matter
Claim 7 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Response to Arguments
Applicant’s arguments with respect to claim(s) 1-20 have been considered but are not persuasive. 
Regarding Claims 1,10 and 16: Applicant argues that the combination of the cited references failed to disclose obtaining predicted character probabilities for the utterance from the transcription neural network. Applicant is basing his argument on the assertion the combination of Yu and Graves  does not disclose an audio utterance. 
Reply, In view of the cited references examiner respectfully disagree because:
Yu discloses in Section 0018, lines 1-12 that the input to the Speech recognition system can be a spoken words from an individual over a particular amount of time which is captured using a microphone.  This means that Yu discloses an audio utterance.  Also the secondary reference  (Graves) also mentions in Page 2 lines 22-25 that “LSTM processes a variety of sequence processing tasks including speech (Audio utterance) and handwriting recognition.  Graves therefore discloses or suggests a Neural Network for speech transcription. 
obtaining predicted character probabilities for the utterance from the transcription neural network”  Yu mentions in section 0044, lines 3-6- that the output sequence (Predicted Posterior probabilities) of the final layer is in senone, phoneme, or word etc. 
This means Yu discloses obtaining predicted posterior probabilities for the utterance from the transcription neural network  and but does not clearly disclose if the obtained predicted posterior probabilities are Characters.
Graves discloses a prediction using character-level language modelling with Recurrent neural networks. (Page 6, lines 20-24 under section 3 Text prediction- thus Graves discloses “Character-level language modelling with neural networks (LSTM) where it is clearly disclosed that predicting one character at a time is more interesting from the perspective of sequence generation, because it allows the network to invent novel words and strings)
Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching of predicting one character at a time in training a neural network. The motivation is that it allows the system to predict at the finest granularity found in the data and also it allows the network to invent novel words and strings. 
Applicant argues in Page 12 of the remarks filed on 11/26/2021 that the combination of Yu et al. (US2012/0072215) and Graves is improper because:
Graves does not disclose processing speech (audio utterance) instead discloses text processing and therefore it will not be ideal for an ordinary skilled in the Art to form the combination of the two references under Obviousness.  
(Audio utterance) and handwriting recognition. This clearly states that the Neural Network (LSTM) taught by Graves clearly teaches processing speech (audio utterance). Therefore it will be obvious to one ordinary skilled in the art  to modify the teaching of Yu to include the teaching of predicting on the character level instead of on the word level. This is practical because in pronunciations lessons (Classes) students are taught to pronounce words on the basis of characters not on the level of words. Examiner’s position is that modifying or enhancing the technology of pronunciation to character level is  a proper practice by one ordinary skilled in the art because Yu and Graves are Analogous Art. 
Applicant argues Yu and Graves would not be combinable as a technical matter because Yu recognize text while Grave synthesize new text. 
Reply, Examiner respectfully disagree because based on applicant’s own admission, (Yu recognize text while Grave synthesize text) this means the two references are analogous prior art (solving the same problem and in the same field of endeavor- thus dealing with text processing). MPEP 2141.01 (a). 
Another reason why the two references are combinable is that Yu clearly discloses speech recognition system that receives spoken words from an individual using a microphone- see Section 0018.  Graves also discloses that the LSTM discussed in their system deals with a variety of sequence processing tasks, includes speech and handwriting recognition. Even though Graves disclosure go in detail of synthesizing text it is clearly disclosed that the system can also be used for speech processing- see Page 
 Examiner’s position is that Yu and Graves are combinable because they are Analogous Prior Art since both references are in the same field of Endeavor. Understand as explained above both references deals with speech processing/recognition. 
Applicant argues that a prima facie case of obviousness has not been made because on skilled in the art would not be motivated to combine Yu and Graves. Applicant is making this argument because it is asserted that Graves would lead away from the claimed invention because Graves performed several experiments and concluded that “the word level RNN performed better than the character-level network”
Reply, Examiner respectfully disagree because applicant is pointing to just one analysis that was done in one of the multiple experiments that was discussed in Graves to make the assertion that Graves conclusion is leading away from using character prediction. Looking at  Page 24 under 5 Handwriting Synthesis, lines 1-4 Graves clearly explains that the networks “word level RNN network” explained above are unable to predict text under character level and therefore discloses a new Synthesis Network Architecture in Fig. 12 that deals with Character level transcription. “Character sequence c is presented to the hidden layers” see page 27, lines 4-6 of Graves. This means that Graves teaches in favor of Networks used for Character level transcription of text. Understand Graves has describe that the LSTM network can be used for speech recognition (thus taking Utterance and outputting text) 

In Page 29 under 5.2 Experiments lines 1-5- thus “The Character-level transcriptions from the 1AM-OnDB were now used to define the character sequences c.” This means Character level transcription is use to predict character level sequences/probabilities. 
Clearly Fig. 12 shows Synthesis Network Architecture the deals with character level transcription. 

    PNG
    media_image1.png
    488
    443
    media_image1.png
    Greyscale

Figure 1 This screenshot illustrates that the network Architecture deals with character level transcription.
Figure 14 below shows predicting text on character level (“end of strokes”- means words are transcribed on character level) for transcription. 
    PNG
    media_image2.png
    599
    802
    media_image2.png
    Greyscale

 
Examiner is taking the stand that One ordinary skilled in the art will combine Yu with the Character level transcription taught by Graves because both references are Analogous Prior Art (same field of Endeavor) in the field of Data processing. See MPEP 2141.01 (a)

From the above findings examiner settles with the position that Graves teaches to support character level prediction with transcription and therefore it will be obvious to one ordinary skilled in the art to combine with the teaching of Yu where the posterior probabilities are outputted based on character level not word level. 

Regarding claims 10 and 16 applicant argues that the combination of the cited references fails to disclose the limitation “decoding a predicted transcription of the input audio using the predicted character probabilities outputs from the transcription neural network constrained by a language model that interprets a string of characters from the predicted character probabilities outputs as a word or words. 
In reply, Examiner respectfully disagree because Applicant’s argument is based on the same arguments addressed above  and therefore Examiner respectfully disagree and maintains his position that the combination of Yu in view of Graves clearly discloses the limitation above. 
     In regards to the how the combination of Yu in view of Graves meets the limitation, refer to the explanation below:
     decoding a predicted transcription of the input audio using the predicted probabilities outputs (Section 0035, lines 8-10- Phoneme senone sequence means the system taught by Yu interprets inputs on the word level) from the transcription neural network  by a language model that interprets a string of characters from the predicted character probabilities (Graves: Page 6, lines 20-24) outputs as a word or words. (Yu: Section 0019, lines 10-15- thus output state posterior probabilities and the fact that the probability is in a senone means the transcription processing was in the word level-See NB) 
 (NB- Section 0014- thus the input signal is processed at the phonemes  level. Thus the “language model converts the series of phonemes into a sequence of words” thus from phonemes  level to word level of transcription) 

                 Regarding dependent claims 2-9, 11-15 and 17-20, applicant argues they are allowable since independent claims are allowable over the cited references based on the arguments filed by applicant. 
                 In reply, examiner respectfully disagree because based on the response above, it is clear that applicant’s argument are not persuasive.



                           Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1,4-5,8,10-11 and 13-19 are rejected under 35 U.S.C. 103 as being unpatentable over Yu (20120072215) in view of Generating Sequences with Recurrent Neural Networks by Alex Graves, will be referred to as Graves from here on in this correspondence. 
               Claim 1, Yu discloses a computer-implemented method for training a transcription neural network, (Section 0020, lines 2-5- thus training the multiple output layer RNN models)  the method comprising: 
(Section 0030, lines 4-7 time-frame-t (or frame-block) of an input sample reads on the spectrogram of the input sample) into a first layer of the transcription neural network that evaluates, for each step of a set of steps, (Section 0027, lines 1-4- thus the beginning of the many layers or layer by layer  of the neural network reads on the first layer) 
                        (regarding spectrogram frames covering time steps Yu addresses this limitation by describing in section 0018, lines 4-6 that the utterance is digitized over a particular amount of time)
                  a spectrogram frame from the set of spectrogram frames and an associated context of one or more spectrogram frames; (Section 0030, lines 1-7- thus labels given the current input both at frame t (which may be a fixed local block of frames)- this means a frame at time (t) (spectrogram) is obtained from a local block of frames)  
               (also in section 0039, lines 1-4- thus the audio mask is generated on a frame by frame basis which means a frame at time t is generated on a frame by frame basis-(one frame from a plurality of frames))
               obtaining predicted probabilities (state posterior probabilities-Section 0019, lines 10-12)  for the utterance from the transcription neural network; (Section 0019, lines 10-15- thus output state posterior probabilities are output or obtained from the received sample 104)
 (Understand sample 104 is an utterance that was received by the speech recognition system- Section 0021) 
(Section 0028, lines 2-8- thus mapped probabilistic relationships of an input onto a set of appropriate outputs) and a corresponding ground truth transcription for the utterance (Section 0028, lines 9 – thus true frame level or utterance level) to determine a loss in predicting the corresponding ground truth transcription for the utterance; (Section 0028, lines 8-10- thus the cross entropy between the true and the predicted probability distributions over class labels is the different between the true transcription and the predicted probability and that reads on the loss)  and  updating one or more parameters of the transcription neural network (Section 0047, lines 6-10 thus node parameter are updated in training the neural network) using a gradient based upon the loss in predicating the utterance. (Section 0026, lines 1-3- thus gradient is used to update the rule of the weights when replacing the model (loss) with running a Gibbs (gradient)).  
          (The determined loss is also address by the secondary reference Graves-2014-see page 4) 
                           Yu does not disclose obtaining predicted character probability (thus Yu does not clearly disclose that the predicted probabilities are Characters.)
                        Graves discloses prediction using character-level language modelling with Recurrent neural networks. (Page 6, lines 20-24 under section 3 Text prediction-Character-level language modelling with neural networks where it is clearly disclosed that predicting one character at a time is more interesting from the perspective of sequence generation, because it allows the network to invent novel words (Original words) - strings of characters- this means it is advantagous). 
Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching of predicting one character at a time in training a neural network. The motivation is that it allows the system to predict at the finest granularity found in the data and also it allows the network to invent novel words and strings. 

Claim 4, Yu in view of Graves discloses wherein the computer implemented method further comprising generating one or more utterances for a set of training data for use in training the transcription neural network. (Yu: Section 0020, lines 2-5- thus training the multiple output layer RNN models in fig. 3 Data A, B and C shows the average of the input utterance of the input layer)
Claim 5, Yu in view of Graves discloses wherein generating one or more utterances for a set of training data for use in training the transcription neural network (Yu: Section 0020, lines 2-5- thus training the multiple output layer RNN models) comprises:
 having a person wear headphones as the person records an utterance; intentionally inducing a Lombard effect during data collection of the utterance by playing background noise through the headphones worn by the person; (Yu, Section 0013, lines 1-4- thus the multiple audio signals includes background noise which is recorded by a microphone)  and capturing the Lombard-effected utterance of the person via a microphone without capturing the background noise. (Yu: Section 0015, lines 1-4- thus the technology disclosed by Yu teaches Lombard effect since speech or noise from different people are captured) 
Claim 7: Please see item 1 for details. 
Claim 8, Yu in view of Graves  discloses further comprising using data parallelism by performing the steps comprises: using several copies of the transcription neural network across multiple processing units with each processing unit processing a separate minibatch of utterances; (Yu Section 0055- thus the separate speaker signals within a multi-speaker audio signal reads on the mini batch utterances)  provided)  and combining a computed gradient from a processing unit with its peers during each iteration. (Yu: Section 0035- thus the sum over the speaker reads on combining the peers’ voices)
Claim 10, Yu discloses a computer-implemented method for transcribing speech comprising: generating a set of spectrogram frames for an input audio; (Section 0028, lines 2-4- thus the frame level data from the input utterance) inputting the set of spectrogram frames into a transcription neural network; (Section 0030, lines 4-7 frame (or frame-block) of an input sample to predict the class labels)
obtaining predicted probabilities outputs from the transcription neural network; (Section 0028, lines 2-8- thus mapped probabilistic relationships of an input onto a set of appropriate outputs) and decoding a predicted transcription (Section 0030, lines 4-7- prediction of the class labels) of the input audio using the predicted probabilities outputs from the transcription neural network constrained by a language model that interprets a string of senone  from the predicted probabilities outputs as a word or words. (Section 0019, lines 10-15- thus output state posterior probabilities and the fact that the probability is in a senone means the transcription processing was in the character level not word level-See NB) 
 (NB- Section 0014- thus the input signal is processed at the phonemes (sound) level. Thus the “language model converts the series of phonemes into a sequence of words” thus from phonemes (sound) level to word level of transcription) 
(The determined loss is also address by the secondary reference Graves-2014-see page 4) 
                           Yu does not disclose obtaining predicted character probability (thus Yu does not clearly disclose that the predicted probabilities are Characters.)
                        Graves discloses prediction using character-level language modelling with Recurrent neural networks. (Page 6, lines 20-24 under section 3 Text prediction-Character-level language modelling with neural networks where it is clearly disclosed that predicting one character at a time is more interesting from the perspective of sequence generation, because it allows the network to invent novel words (Original words) - strings of characters- this means it is advantagous). 
Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching of predicting one character at a time in training a neural network. The motivation is that it allows the system to predict at the finest granularity found in the data and also it allows the network to invent novel or original words and character strings. 
Claim 11, Yu in view of Graves discloses that the method further comprising using the predicted transcription for the input audio and a corresponding ground truth transcription for the input audio to determine a loss in predicting the corresponding ground truth transcription for the input audio; (Yu: Section 0026, lines 1-3- thus gradient is used to update the rule of the weights when replacing the model (loss) with running a Gibbs (gradient))
and updating one or more parameters of the transcription neural network using the loss in predicting the corresponding ground truth transcription for the input audio. (Yu: Section 0047, lines 6-10 thus node parameter are updated in training the neural network)
Claim 13, Yu in view of Graves (Page 6, lines 20-24 under section 3 Text prediction-Character-level language modelling with neural networks) discloses wherein the step of inputting the set of spectrogram frames into a transcription neural network comprises: inputting the set of spectrogram frames into the transcription neural network (Yu: Section 0030, lines 4-7) in which at least one layer of the transcription neural network operates on a context of spectrogram frames from the set of spectrogram frames. (Yu: Section 0046: Thus the output layer outputs the average results of the neural network based on the input utterance) 
Claim 14, Yu in view of Graves discloses wherein the step of decoding a predicted transcription of the input audio using the predicted character probabilities outputs from the transcription neural network constrained by a language model that interprets a string of characters from the predicted character probabilities outputs as a word or words comprises: (Yu: Section 0019, lines 10-15- thus output state posterior probabilities and the fact that the probability is in a senone means the transcription processing was in the character level not word level-See NB) 
 (NB- Section 0014- thus the input signal is processed at the phonemes  level. Thus the “language model converts the series of phonemes into a sequence of words” thus from phonemes  level to word level of transcription) 
given the predicted character probabilities outputs from the transcription neural network, performing a search to find a sequence of characters that is most probable (Yu: Section 0028, lines 8-10- “predicted probability distribution”) according to both the predicted character probabilities outputs  and an N-gram language model output that interprets a string of characters from the predicted character probabilities outputs as a word or words. (Yu: Section 0035, lines 8-11- thus the state sequence is generated and map to the phoneme/senone sequence which reads on the string of characters) 
Claim 15, Yu in view of Graves discloses wherein the transcription neural network comprises: a first set of three layers that are non-recurrent; a fourth layer that is a bi-directional recurrent network which includes two sets of hidden units comprising a set with forward recurrence (Yu: Section 0036, lines 8-12: Yu discloses multiple layers of stochastic hidden units) and a set with backward recurrence; and a fifth layer that is a non-recurrent layer, which takes forward and backward units from the fourth layer as inputs and outputs the predicted character probabilities. (Yu Section 0047, lines 16-20- thus the phone recognition task to increase the probability reads on the character probabilities) 
Claim 16, Yu discloses a non transitory computer-readable medium or media comprising one or more sequences of instructions which, when executed by one or more processors, (Section 0052 lines 14-18 Processor access memory) 
causes steps to be performed comprising: generating a set of spectrogram frames for an input audio; (Section 0030, lines 4-7 frame (or frame-block) of an input sample to predict the class labels)
inputting the set of spectrogram frames into a transcription neural network; obtaining predicted probabilities outputs from the transcription neural network; (Section 0027, lines 1-4- thus the beginning of the many layers or layer by layer  of the neural network reads on the first layer)  and decoding a predicted transcription of the input audio using the predicted probabilities outputs (Section 0035, lines 8-10- Phoneme senone sequence means the system taught by Yu interprets inputs on the word level) from the transcription neural network  by a language model that interprets a string of senone from the predicted probabilities  outputs as a word or words. (Yu: Section 0019, lines 10-15- thus output state posterior probabilities and the fact that the probability is in a senone means the transcription processing was in the word level-See NB) 
 (NB- Section 0014- thus the input signal is processed at the phonemes  level. Thus the “language model converts the series of phonemes into a sequence of words” thus from phonemes  level to word level of transcription) 
(The determined loss is also address by the secondary reference Graves-2014-see page 4) 
                           Yu does not disclose obtaining predicted character probability because Yu mentions in section 0044, lines 3-6- that the output sequence of the final layer is in senone, phoneme, word etc and also teaches only predicted posterior probabilities but does not clearly teaches obtaining predicted character probabilities.
                        Graves discloses text prediction using character-level language modelling with Recurrent neural networks. (Page 6, lines 20-24 under section 3 Text prediction-Character-level language modelling with neural networks where it is clearly disclosed that predicting one character at a time is more interesting from the perspective of sequence generation, because it allows the network to invent novel words- strings of characters). 
Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching of predicting one character at a time in training a neural network. The motivation is that it allows the system to predict at the finest granularity found in the data and also it allows the network to invent novel words and strings.

Claim 17, Yu in view of Graves discloses that the non-transitory computer readable medium or media further comprising one or more sequences of instructions which, when executed by one or more processors, (Yu: Section 0055, lines 4-7- thus the processor that causes the system to execute the software in the memory)  causes steps to be performed comprising:
using the predicted transcription for the input audio and a corresponding ground truth transcription for the input audio (Yu: Section 0028, lines 9 – thus true frame level or utterance level) to determine a loss in predicting the corresponding ground truth transcription for the input audio; (Yu: Section 0028, lines 8-10- thus the cross entropy between the true and the predicted probability distributions over class labels reads on the loss)
and updating one or more parameters of the transcription neural network (Yu: Section 0047, lines 6-10 thus node parameter are updated in training the neural network) using the loss in predicting the corresponding ground truth transcription for the input audio. (Yu: Section 0026, lines 1-3- thus gradient is used to update the rule of the weights when replacing the model (loss) with running a Gibbs (gradient)). 

Claim 18, Yu in view of Graves discloses wherein the step of inputting the set of spectrogram frames into a transcription neural network comprises: inputting the set of spectrogram frames into the transcription neural network in which at least one layer of the transcription neural network operates on a context of spectrogram frames from the set of spectrogram frames. (Yu: Section 0002, lines 14-18- thus “each layer of hidden units learns to represents features that captures higher order correlations in original input data” this means the hidden layers operates based on captured features (context) of the original input utterance) 
Claim 19, Yu in view of Graves discloses wherein the step of decoding a predicted transcription of the input audio using the predicted character probabilities outputs from the transcription neural network constrained by a language model that interprets a string of characters from the predicted character probabilities outputs as a word or words (Yu: Section 0019, lines 10-15- thus output state posterior probabilities and the fact that the probability is in a senone means the transcription processing was in the character level not word level-See NB) 
 (NB- Section 0014- thus the input signal is processed at the phonemes  level. Thus the “language model converts the series of phonemes into a sequence of words” thus from phonemes level to word level of transcription)  comprises:
 given the predicted character probabilities outputs from the transcription neural network, performing a search to find a sequence of characters that is most probable according to both the predicted character probabilities outputs  (Yu: Section 0051, lines 6-12- thus “bi-gram language model (LM) features in which each output unit sequence w is consisted of output units thus the model outputs sequence characters such as senones or phonemes)  and an N-gram language model output (Yu: See Fig. 7  shows  Layer N which represents the N-gram language model output) that interprets a string of characters from the predicted character probabilities outputs as a word or words. (Yu: Section 0035, lines 8-10- Phoneme senone sequence means the system taught by Yu interprets inputs on the character level) 
Claims 2,3,6,9,12 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Yu (20120072215) in view of Alex Graves, (2014) will be referred to as Graves from here on in this correspondence in view of Zhang 2007

Claim 2, Yu in view of Grave does not discloses wherein the computer implemented method further comprising jittering at least some of the utterances of the set of one or more utterances prior to inputting into the transcription neural network. 
Zhang discloses wherein the computer implemented method further comprising jittering at least some of the utterances of the set of one or more utterances prior to inputting into the transcription neural network. (Zhang: Section 1, lines 40-46- thus mapping due to jittered input data and using jitter input data to form ensembles).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching of shifting the signal before inputting into the voice recognition system. The motivation is that it allows the input signals to be tested before processing and it narrows the prediction error.
Claim 3, Yu in view of Grave discloses obtaining output results from the transcription neural network for the set of spectrograms and averaging the output results for the set of spectrograms to obtain an output for the utterance. (Yu: Section 0046: Thus the output layer outputs the average results of the neural network based on the input utterance) 
However Yu in view of Grave does not  discloses wherein the step of jittering at least some of the set of utterances prior to inputting into the transcription neural network comprises:
generating a jitter set of utterances for the utterance by translating an audio file of the utterance by one or more time values and converting the jitter set of utterances and the utterance into a set of spectrograms;
Section 1, page 5330 lines 34-36-comprises:
generating a jitter set of utterances for the utterance by translating an audio file of the utterance by one or more time values (Zhang in Section 2, page 5331, lines 19-22 discloses that adding noise to the training sample before inputting it into the neural network) and converting the jitter set of utterances and the utterance into a set of spectrograms; (Zhang: Fig. 3 shows spectrogram of the noise added to the utterances)
 
Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching of adding noise to the input noise before inputting into the neural network. The motivation is that it enables the system to perform better.

Claim 6, Yu in view of Graves discloses adjusting a signal-to-noise ratio of the noise track relative to an audio file; (Yu: Section 0013: lines 6-10- thus the signal to noise ratio can be adjusted)
Yu in view of Graves does not discloses wherein generating one or more utterances for a set of training data for use in training the transcription neural network comprises: adding one or more noise clips selected from a set of approved noise clips to form a noise track; 

Zhang discloses wherein generating one or more utterances for a set of training data for use in training the transcription neural network comprises: adding one or more noise clips selected from a set of approved noise clips to form a noise track; (Section 4, lines 8-10- thus the increase in noise level means adding more noise level to the training data) 
adding the adjusted noise track to the audio file to form a synthesized noise audio file; and adding the synthesized noise audio file to the set of training data. ( Section 4, lines 15-20- Noise 2 To Noise 4 reads on the synthesized noise tracks that are added to the set of training data as shown in Table 1)
Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching of adding noise to the input noise before inputting into the neural network. The motivation is that it enables the system to perform better.

Claim 9, Yu in view of Graves discloses that the computer implemented method further comprising using data parallelism by performing the steps comprises:
having each processing unit process many utterances in parallel by concatenating many utterances into a single matrix; (Yu; Section 0037, lines 3-5- thus the synthesized speech data can combine speech and analyzed and therefore processed) 

Zhang discloses disclose sorting utterances by length and combining similarly-sized utterances into minibatches (Zhang: Section 4, lines 20-24- thus Zhang sorts the noise (utterance) based on the levels such as lowest noise level to the highest level) and padding utterances with silence so that all utterances in a minibatch have the same length. (Zhang: Section 4 lines 26-31- thus Noise (Utterance) with the same level are grouped together as shown in Table 5) 
(NB: Understand that the noises are recorded as part utterance). 
Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching of adding noise to the input noise before inputting into the neural network. The motivation is that sorting the utterances makes it possible to know which appropriate utterances to apply which will improve performance. 

Claim 12, Yu in view of Graves and further in view of Zhang discloses wherein the step of generating a set of spectrogram frames from the input audio comprises: generating a set of spectrogram frames from a normalized version of the input audio or from a normalized  (Yu: Section 0030, lines 4-7, the frame block of the input sample reads on the spectrogram) and jitter version of the input audio. (Zhang in Section 2, lines 6-9 discloses that training with jitters helps prevents additional constraints on the system by providing jitters with the input utterance)

Claim 20, Yu in view of Graves and further in view of Zhang discloses wherein the transcription neural network comprises:
a first set of three layers that are non-recurrent; a fourth layer that is a bi-directional recurrent network, (Yu: Fig. 7 shows  Layer N and therefore layer has more than 3 layers) which includes two sets of hidden units comprising a set with forward recurrence and a set with backward recurrence; (Zhang: Section 3.2, lines 30-33- thus using the standard feedforward neural network) and a fifth layer that is a non-recurrent layer, (Yu: Section 0037, lines 20-21)  which takes forward (Zhang: Section 1, Page 5330, lines 2 forecasting using feedforward neural network) and backward units from the fourth layer as inputs and outputs the predicted character probabilities. (Yu: Section 0003, lines 3-5- thus back-propagation reads on the backwards units) 



Cited Art
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Huang et al. (2006/0031069) discloses a speech recognition Engine that Synthesize speech by using a Grapheme-To-Phoneme module. The system also includes N-Gram Grapheme model that references the N-
Peck (20050114118) discloses jitter buffer that compensate for packets having varying amount of network latency. The system as taught by Peck discloses that this approach may temporarily clip the stream used by the VAD.  

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 


Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, KING D POON can be reached on 571-272-7440.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.