Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Compact Prosecution
Examiner would like to suggest amending the independent claims1, 9 and 16 to include the limitations “a diversity of streams in terms of temporal resolution for a multistream Convolution Neural Network architecture to achieve expected result, wherein a word error rate (WER) between original multistream CNN models with the 6-9-12”. These amendments will overcome the current rejection. 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-2 and 4-17 are rejected under 35 U.S.C. 103 as being unpatentable over Van (US 2018/0075343) in view of Jang (US 20200175313), referred to as Van in view of Jang. 
Claim 1, Van discloses a computer-implemented method for processing speech (Section 0004, lines 5-7- sequence of audio data), comprising:
 receiving a sequence of vectors computed from an audio signal; (Section 0075 the input of the convolutional neural network includes one or more global features which are vectors (See One-hot encoded vector -lines 8-9 in section 0075), also Section 0101 addresses the same issue) computing a first stream vector by processing the sequence of feature vectors in a first stream, ( Section 0076, lines 1-4- “the audio sequences (feature vector stream) are conditioned …by conditioning the activation function of some or all of the convolutional layers (First computations)  in the convolutional subnetwork” – thus each convolutional layer receives and process a sequence of audio data which based on Section 0075 above are in vectors)  
wherein the first stream comprises a first convolutional neural network layer having a first dilation rate;  (Section 0078, lines 2-4- thus dilated causal convolutional layer 204 with dilation one where the dilation rate will be 1 for layer 204) 
computing a second stream vector by processing the sequence of vectors in a second stream, (( Section 0076, lines 1-4- “the audio sequences (feature vector stream) are conditioned …by conditioning the activation function of some or all of the convolutional layers (second computations)  in the convolutional subnetwork” – thus each convolutional layer receives and process a sequence of audio data which based on Section 0075 above are in vectors)
 wherein the second stream comprises a second convolutional neural network layer having a second dilation rate, (Section 0078, lines 2-4- thus dilated causal convolutional layer 206 with dilation two where the dilation rate will be 2 for layer 206)

    PNG
    media_image1.png
    612
    1323
    media_image1.png
    Greyscale

Figure 1: One ordinary Skilled in the art will see that there is a first and second Dilation rate.
(The secondary reference JANG-20200175313 also addresses in Fig. 7 of a system where there is a dilation Rate ‘2’ r=r/2 and another dilation Rate ‘3’ r-r/3 and also see the screenshot below)

    PNG
    media_image2.png
    456
    957
    media_image2.png
    Greyscale

Figure 2: El 713 and 715 shows that a first Dilation rate is different from a second Dilation rate.
wherein the second dilation rate is different from the first dilation rate;  (Clearly based on Section 0078 and Fig. 2 the dilation rate for Convolutional layer 204- (dilated rate of 1) and 206 – (dilated rate of 2) are different- See Section 0081, lines 1-6)  and computing a vector of speech unit scores by processing the first stream vector and the second stream vector. (Section 0123, lines 7-10- thus the Neural network generate a sequence of words, the score distribution includes a respective score for each word in a vocabulary of words). 
Van clearly discloses using vectors but is silent about if the vectors are feature vectors. 
Jang in Fig. 4 uses feature maps which show particular numbers (vectors) per maps which reads on feature vectors. (Section 0083- thus feature maps 430 are spaced apart by 1). 
Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching using feature maps (vector) in the neural network. The motivation is that it will give internal representation for specific input speech data which will make the neural network very effective. 

Claim 2, Van in view of Jang discloses comprising processing the vector of speech unit scores to determine one or more words spoken in the audio signal. (Van: Section 0123, lines 4-7- thus the score distribution includes a respective score for each word in a vocabulary of words within the speech of the speaker) 
Claim 4, Van in view of Jang discloses wherein the sequence of feature vectors is computed by processing a second sequence of feature vectors with one or more neural network layers. (Van, Section 0123, lines 5-7- thus “score distribution contains score for each word in a vocabulary of words” and this means each word and its synonyms has a specific properties in a particular phenomenon) 
Claim 5, Van in view of Jang discloses comprising: computing a third stream vector by processing the sequence of feature vectors in a third stream, (Section 0078, lines 5-6 “Dilated causal Convolutional layer 208 with dilation four reads on the third stream) wherein the third stream comprises a third convolutional neural network layer having a third dilation rate, (Van: Section 0078, lines 5-7- thus “dilated casual convolutional layer 208 with dilation four)  wherein the third dilation rate is different from the first dilation rate and is different from the second dilation rate; (Van: clearly based on Section 0078, dilation four is different from dilation one and dilation two)  and 31Attorney Docket No. ASAP-0023-U01 wherein computing the vector of speech unit scores comprises processing the third stream vector. (Van: Based on Section 0123, lines 5-7 the synonyms includes layers 208 which reads on the third stream of vectors) 
Claim 6, Van in view of Jang discloses wherein the first stream comprises a sequence of three or more convolutional neural network layers each having the first dilation rate. (Van: Section 0080, various sequences operated on by the layers in the block and this means within a block or a layer or a particular dilated rate includes a variety of sequence of audio data). 
Claim 7, Van in view of Jang discloses wherein the first convolutional neural network layer comprises a time-delay neural network layer. (Van: Section 0119, lines 3-4- thus “the system 500 waits until a specified number of inputs have been seen before beginning processing” the system waiting means the system accounts for delay) 
Claim 8, Van in view of Jang discloses wherein the first convolutional neural network layer comprises a factorized time-delay neural network layer. (Van: Section 0119, lines 3-4- thus “the system 500 waits until a specified number of inputs have been seen before beginning processing” the system waiting means the system accounts for delay) 

Claim 9, Van discloses a system, (Fig. 1) comprising:
at least one server computer comprising at least one processor and at least one memory, (Section 0203- thus ‘computer readable media storing program instructions and Data processing apparatus in Section 0205) the at least one server computer (Section 0207- Data server) configured to:
receiving a sequence of  vectors computed from an audio signal; (Section 0075 the input of the convolutional neural network includes one or more global features which are vectors (See One-hot encoded vector -lines 8-9 in section 0075), also Section 0101 addresses the same issue) computing a first stream vector by processing the sequence of vectors in a first stream, ( Section 0076, lines 1-4- “the audio sequences (vector stream) are conditioned …by conditioning the activation function of some or all of the convolutional layers (First computations)  in the convolutional subnetwork” – thus each convolutional layer receives and process a sequence of audio data which based on Section 0075 above are in vectors) wherein the first stream comprises a first convolutional neural network layer having a first dilation rate; (Section 0078, lines 2-4- thus dilated causal convolutional layer 204 with dilation one where the dilation rate will be 1 for layer 204) 
computing a second stream vector by processing the sequence of  vectors in a second stream, (( Section 0076, lines 1-4- “the audio sequences (feature vector stream) are conditioned …by conditioning the activation function of some or all of the convolutional layers (second computations)  in the convolutional subnetwork” – thus each convolutional layer receives and process a sequence of audio data which based on Section 0075 above are in vectors)
wherein the second stream comprises a second convolutional neural network layer having a second dilation rate, (Section 0078, lines 2-4- thus dilated causal convolutional layer 206 with dilation two where the dilation rate will be 2 for layer 206)

    PNG
    media_image1.png
    612
    1323
    media_image1.png
    Greyscale
 
Figure 3: One ordinary Skilled in the art will see that there is a first and second Dilation rate.


(The secondary reference JANG-20200175313 also addresses in Fig. 7 of a system where there is a dilation Rate ‘2’ r=r/2 and another dilation Rate ‘3’ r-r/3 and also see the screenshot below)

    PNG
    media_image2.png
    456
    957
    media_image2.png
    Greyscale

Figure 4: El 713 and 715 shows that a first Dilation rate is different from a second Dilation rate.

wherein the second dilation rate is different from the first dilation rate; (Clearly based on Section 0078 and Fig. 2 the dilation rate for Convolutional layer 204- (dilated rate of 1) and 206 – (dilated rate of 2) are different- See Section 0081, lines 1-6) and
computing a vector of speech unit scores by processing the first stream vector and the second stream vector. (Section 0123, lines 7-10- thus the Neural network generate a sequence of words, the score distribution includes a respective score for each word in a vocabulary of words). 
Van clearly discloses using vectors but is silent about if the vectors are feature vectors. 
Jang in Fig. 4 uses feature maps which show particular numbers (vectors) per maps which reads on feature vectors. (Section 0083- thus feature maps 430 are spaced apart by 1). 
Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching using feature maps (vector) in the neural network. The motivation is that it will give internal representation for specific input speech data which will make the neural network very effective. 

Claim 10, Van in view of Jang discloses wherein the at least one server computer is configured to process the vector of speech unit scores to determine one or more words spoken in the audio signal. (Van: Section 0123, lines 4-7- thus the score distribution includes a respective score for each word in a vocabulary of words within the speech of the speaker) 

Claim 11, Van in view of Jang discloses wherein the at least one server computer is configured to compute the vector of speech unit scores by concatenating two or more stream vectors, the two or more stream vectors comprising the first stream vector and the second stream vector. (Van: Section 0123, lines 7-10- thus the Neural network generate a sequence of words, the score distribution includes a respective score for each word in a vocabulary of words). 

Claim 12, Van in view of Jang discloses wherein the at least one server computer is configured to compute the vector of speech unit scores using batch normalization. (Van: Section 0033, lines 1-3- thus the system implements a sub-batch normalization layer) 
Claim 13, Van in view of Jang discloses wherein the second dilation rate is a multiple of the first dilation rate.(Van: Section 0078, lines 5-7- thus “dilated casual convolutional layer 208 with dilation four)
Claim 14, Van in view of Jang discloses wherein the first dilation rate is a multiple of a subsampling rate corresponding to the sequence of feature vectors. (Van: clearly based on Section 0078, dilation four is different from dilation one and dilation two)  
Claim 15, Van in view of Jang discloses wherein the first stream comprises a sequence of three or more convolutional neural network layers each having the first dilation rate. (Van: clearly based on Section 0078, dilation four is different from dilation one and dilation two)  
Claim 16, Van discloses One or more non-transitory, computer-readable media comprising computer- executable instructions that, when executed, (Section 0203- thus ‘computer readable media storing program instructions and Data processing apparatus in Section 0205) cause at least one processor to perform actions comprising: 
receiving a sequence of vectors computed from an audio signal; (Section 0075 the input of the convolutional neural network includes one or more global features which are vectors (See One-hot encoded vector -lines 8-9 in section 0075), also Section 0101 addresses the same issue)computing a first stream vector by processing the sequence of feature vectors in a first stream, ( Section 0076, lines 1-4- “the audio sequences (feature vector stream) are conditioned …by conditioning the activation function of some or all of the convolutional layers (First computations)  in the convolutional subnetwork” – thus each convolutional layer receives and process a sequence of audio data which based on Section 0075 above are in vectors)  
wherein the first stream comprises a first convolutional neural network layer having a first dilation rate; (Section 0078, lines 2-4- thus dilated causal convolutional layer 204 with dilation one where the dilation rate will be 1 for layer 204) 
computing a second stream vector by processing the sequence of vectors in a second stream, (( Section 0076, lines 1-4- “the audio sequences (feature vector stream) are conditioned …by conditioning the activation function of some or all of the convolutional layers (second computations)  in the convolutional subnetwork” – thus each convolutional layer receives and process a sequence of audio data which based on Section 0075 above are in vectors)
 wherein the second stream comprises a second convolutional neural network layer having a second dilation rate, (Section 0078, lines 2-4- thus dilated causal convolutional layer 206 with dilation two where the dilation rate will be 2 for layer 206)

    PNG
    media_image1.png
    612
    1323
    media_image1.png
    Greyscale

Figure 5: One ordinary Skilled in the art will see that there is a first and second Dilation rate.
(The secondary reference JANG-20200175313 also addresses in Fig. 7 of a system where there is a dilation Rate ‘2’ r=r/2 and another dilation Rate ‘3’ r-r/3 and also see the screenshot below)

    PNG
    media_image2.png
    456
    957
    media_image2.png
    Greyscale

Figure 6: El 713 and 715 shows that a first Dilation rate is different from a second Dilation rate.
 
wherein the second dilation rate is different from the first dilation rate; (Clearly based on Section 0078 and Fig. 2 the dilation rate for Convolutional layer 204- (dilated rate of 1) and 206 – (dilated rate of 2) are different- See Section 0081, lines 1-6)  
and computing a vector of speech unit scores by processing the first stream vector and the second stream vector. (Section 0123, lines 7-10- thus the Neural network generate a sequence of words, the score distribution includes a respective score for each word in a vocabulary of words).
Van clearly discloses using vectors but is silent about if the vectors are feature vectors. 
Jang in Fig. 4 uses feature maps which show particular numbers (vectors) per maps which reads on feature vectors. (Section 0083- thus feature maps 430 are spaced apart by 1). 
Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching using feature maps (vector) in the neural network. The motivation is that it will give internal representation for specific input speech data which will make the neural network very effective. 


Claim 17, Van in view Jang discloses wherein the actions comprise processing the vector of speech unit scores to determine one or more words spoken in the audio signal. (Van, Section 0123, lines 5-7- thus “score distribution contains score for each word in a vocabulary of words” and this means each word and its synonyms has a specific properties in a particular phenomenon) 

Claims 3, 18, 19 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Van (US2018/0075343) in view of Jang (US 20200175313) as applied to claims 1-2, 4-17 and 20 above, and further in view of Wu (US20200051583)
Claim 3, Van in view of Jang discloses wherein the sequence of feature vectors (Van, Section 0123, lines 5-7- thus “score distribution contains score for each word in a vocabulary of words” and this means each word and its synonyms has a specific properties or feature expressed in vector) 
However Van in view of Jang does not discloses wherein the properties comprise a sequence of vectors of Mel-frequency cepstral coefficients.
Wu discloses a convolutional neural network that processes input sequence wherein the sequence of vectors includes  Mel-frequency cepstral coefficients. (Section 0071, lines 1-3- thus the input representation for each time step to generate a Mel-frequency for each time step). 
Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching of using Mel-frequency in processing audio data. The motivation is that using the Mel frequency makes it easier for to distinguish between similar low frequency sounds than similar high frequency sounds.
Claim 18, Van in view of Jang discloses a first stream (Van: Section 0076, lines 1-4)  however Van in view of Jang does not disclose a self-attention layer.
Wu discloses a self-attention layer. (Section 0023, lines 1-3- thus the encoder neural network include an attention network).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching of using attention network in processing audio data. The motivation is that using the attention network allows the inputs to interact with each other and find out they should pay more attention. 

Claim 19, Van in view of Jang discloses not disclose that the computer-readable media of claim 18, wherein the self-attention layer comprises a factorized feed forward layer.
Wu discloses wherein the self-attention layer comprises a factorized feed forward layer. (Section 0023, lines 1-3- thus the encoder neural network include an attention network).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching of using attention network in processing audio data. The motivation is that using the attention network allows the inputs to interact with each other and find out they should pay more attention. 
Claim 20, Van in view of Jang and further in view of Wu discloses wherein the self-attention layer a skip connection. (Wu: Section 0049- thus the convolutional subnetwork includes skip connections).  
Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching of using skip connections in processing audio data. The motivation is that using the skip connection allows the inputs to interact with each other and find out they should pay more attention. 
	Cited Art
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Sundaram (US20200065675) discloses relates to training a convolutional neural network-based classifier on training data using a backpropagation-based gradient update technique that progressively match outputs of the convolutional network network-based classifier with corresponding ground truth labels. The convolutional neural network-based classifier comprises groups of residual blocks, each group of residual blocks is parameterized by a number of convolution filters in the residual blocks, a convolution window size of the residual blocks, and an atrous convolution rate of the residual blocks, the size of convolution window varies between groups of residual blocks, the atrous convolution rate varies between groups of residual blocks.
Kalchbrenmer (US20180329897) discloses a system is configured to receive an input sequence of source embeddings representing a source sequence of words in a source natural language and to generate an output sequence of target embeddings representing a target sequence of words that is a translation of the source sequence into a target natural language, the system comprising: a dilated convolutional neural network configured to process the input sequence of source embeddings to generate an encoded representation of the source sequence, and a masked dilated convolutional neural network configured to process the encoded representation of the source sequence to generate the output sequence of target embeddings.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Akwasi M Sarpong whose telephone number is (571)270-3438. The examiner can normally be reached Mon-Fri. 8:00am-4:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, KING D POON can be reached on 571-272-7440. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





	/AKWASI M SARPONG/           Primary  Examiner, Art Unit 2675                                                                                                                                                                                                          08/10/2022