DETAILED ACTION
Introduction
This office action is in response to Applicant’s submission filed on 1/4/2021. Claims 1-22 are pending in this application. As such, claims 1-22 have been examined.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 9/16/2022 and 9/20/2022 were filed.  The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim(s) 1-7, 9-13, and 15-22 is/are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Sypniewski et al. (US 20200035222 A1) (Further referred to as “Sypniewski”).

Regarding Claim 1, Sypniewski teaches an automatic speech recognition system, comprising: an encoder comprising a plurality of encoder layers sequentially executed by one or more graphic processing units (GPUs), wherein at least one encoder layer comprises a plurality of encoder sublayers fused into one or more encoder kernels, wherein the encoder receives one or more audio sequences and generates an encoder output (Sypniewski Paragraph 66 - CNN stack 202 may include any number of convolutional layers, each including various size convolutional kernels. The relevant hyperparameters of CNN stack 202 include the dimension and number of CNN stack, the dimension and number of convolutional kernels at each layer, the stride of the convolutional kernels, and the number and function of any pooling stack. Convolutional kernels may be square, such as of size 5×5, or rectangular, such as of size 3×9, for example. Rectangular convolutional kernels that are ‘narrow’ along the time-axis may be more sensitive to features that are spread out over a wide range of frequencies but local to a short time period. Similarly, rectangular convolutional kernels that are ‘wider’ along the time-axis may detect acoustic features that are confined to a relatively narrow audio band but are of longer duration in time. Convolution kernels may also be referred to as windows, filters, or feature detectors. CNN in an encoder-decoder network.);
a first pair of ping-pong buffers, wherein the one or more encoder kernels respectively read from one of the first pair of ping-pong buffers and write into the other of the first pair of ping-pong buffers (Sypniewski Paragraph 156 - As portions of the training data are downloaded from the training data store 1610, the training augmentation system 1642 buffers it in the memory of the server 1640. Training data augmentation system 1642 monitors the streaming download to determine if sufficient data has been downloaded to begin training. Training data augmentation system 1642 determines when the amount of data downloaded exceeds a threshold to determine when to begin training. Training may begin before the entire training dataset is downloaded, by training using the buffered portions. Once sufficient training data is buffered on the server 1640, the training data augmentation system 1642 applies the requested augmentations to the buffered data. It sends the augmented training data as a stream to the training process 1644. The training data augmentation system 1642 continues to stream additional training data from the training data store 1610. As this data is buffered on server 1640, training data augmentation system 1642 applies the requested augmentations to the data and streams it to the training process 1644.);
and a decoder that receives a decoder input based on the encoder output and generates a decoder output, wherein the decoder comprises a plurality of decoder layers sequentially executed by one or more GPUs, wherein at least one decoder layer comprises a plurality of decoder sublayers fused into one or more decoder kernels (Sypniewski Paragraph 66 - CNN stack 202 may include any number of convolutional layers, each including various size convolutional kernels. The relevant hyperparameters of CNN stack 202 include the dimension and number of CNN stack, the dimension and number of convolutional kernels at each layer, the stride of the convolutional kernels, and the number and function of any pooling stack. Convolutional kernels may be square, such as of size 5×5, or rectangular, such as of size 3×9, for example. Rectangular convolutional kernels that are ‘narrow’ along the time-axis may be more sensitive to features that are spread out over a wide range of frequencies but local to a short time period. Similarly, rectangular convolutional kernels that are ‘wider’ along the time-axis may detect acoustic features that are confined to a relatively narrow audio band but are of longer duration in time. Convolution kernels may also be referred to as windows, filters, or feature detectors. CNN in an encoder-decoder network.).

Regarding Claim 2, Sypniewski teaches all of the limitations of claim 1. Sypniewski also teaches that the plurality of encoder layers comprise a first encoder layer, an intermediate encoder layer, and a last encoder layer, wherein the first encoder layer receives the one or more audio sequences and generates a first encoder layer output, the intermediate encoder layer receives the first encoder layer output from the first encoder layer and generates an intermediate encoder layer output, and the last encoder layer receives the intermediate encoder layer output and generates the encoder output (Sypniewski Paragraph 72 - In some embodiments, the input to CNN stack 202 is a segment of frames of audio features produced by front-end module 201. For each output frame, a context of frames before and/or after the output frame may be included in the segment. For example, for each frame of audio, CNN stack 202 may operate on a ‘window’ of the 5 previous frames and the following 5 frames, for a total of 11 frames. In this example, if there are 40 audio features per frame, CNN stack 202 would then operate on an input having dimensions of 11×40.);
and wherein the decoder receives the decoder input at a current time step of a plurality of time steps, and the decoder input is based on the encoder output and a decoder output generated by the decoder at a time step prior to the current time step (Sypniewski Paragraph 81 - LSTM and GRU type RNNs include at least one back loop where the output activation of a neural network enters as an input to the neural network at the next time step. In other words, the output activation of at least one neural network node is an input to at least one neural network node of the same or a prior layer in a successive time step. More specifically, the LSTM or GRU compute a hidden state, comprising a vector, through a series of mathematical operations, which is produced as an output of the neural network at each time step. The hidden state is passed as an input to the next time step of the LSTM or GRU. In an embodiment, an LSTM has three inputs at a particular time step, the hidden step passed from the previous time step, the output tensor value of the previous time step, and the input frame or tensor representation of the frame of the current time step.).

Regarding Claim 3, Sypniewski teaches all of the limitations of claim 2. Sypniewski also teaches that the plurality of encoder layers further comprise one or more intermediate encoder layers, and the one or more intermediate encoder layers respectively receives an output generated from a previous intermediate encoder layer (Sypniewski Paragraph 55 - Neural networks comprise a plurality of neural network nodes organized in one or more layers. Each node has one or more inputs, an activation function, and an output. The inputs and output may generally be real number values. The inputs to the node are combined through a linear combination with weights and the activation function is applied to the result to produce the output. The output may be transmitted as an input to one or more other nodes in subsequent layers. The weights in the linear combination may be referred to as the weights of the node, and each node may have different weights. Neural network nodes may be organized in one or more layers. An input layer may comprise input nodes whose values may correspond to inputs to the neural network, without use of an activation function. An output layer may comprise one or more output nodes corresponding to output from the neural network. Neural network layers other than the input layer and output layer may be hidden layers, and the nodes in those layers may be referred to as hidden nodes.).

Regarding Claim 4, Sypniewski teaches all of the limitations of claim 1. Sypniewski also teaches a decoder memory cache, wherein the one or more decoder kernels parallelly communicate with the decoder memory cache (Sypniewski Paragraph 157 - The buffered, un-augmented training dataset downloaded from the training data store 1610 to server 1640 may be stored temporarily or permanently on server 1640 to provide caching. When training process 1646 requests to train on the same training data, the training data augmentation system 1642 may check the cache to see if the training dataset is already buffered in local memory of the server 1640. If the training dataset is already present, the training data augmentation system may use the cached version of the training dataset, instead of fetching the training dataset from the training data store 1610. If the training dataset is not in the cache, then the training data augmentation system 1642 may initiate a fetch of the training dataset from the training data store 1610.).

Regarding Claim 5, Sypniewski teaches all of the limitations of claim 1. Sypniewski also teaches that the plurality of encoder sublayers comprise a first encoder fully connected (FC) sublayer, a second encoder FC sublayer, and a third encoder FC sublayer, wherein the first encoder FC sublayer, the second encoder FC sublayer, and the third encoder FC sublayer are fused into an encoder FC kernel (Sypniewski Paragraph 74 - A fully-connected neural network is a neural network in which all nodes in a layer of the neural network are connected to all nodes of the subsequent layer of the neural network. A fully-connected layer 203 comprises one or more fully-connected neural networks placed end-to-end. The term fully-connected comes from the fact that each layer is fully-connected to the subsequent layer. A fully-connected neural network is one kind of densely connected neural network, where a densely connected neural network is one where most of the nodes in each layer of the neural network have edge connections to most of the nodes in the subsequent layer.);
and wherein the plurality of decoder sublayers comprise a first decoder FC sublayer, a second decoder FC sublayer, and a third decoder FC sublayer, wherein the first decoder FC sublayer, the second decoder FC sublayer, and the third decoder FC sublayer are fused into a decoder FC kernel (Sypniewski Paragraph 74 - A fully-connected neural network is a neural network in which all nodes in a layer of the neural network are connected to all nodes of the subsequent layer of the neural network. A fully-connected layer 203 comprises one or more fully-connected neural networks placed end-to-end. The term fully-connected comes from the fact that each layer is fully-connected to the subsequent layer. A fully-connected neural network is one kind of densely connected neural network, where a densely connected neural network is one where most of the nodes in each layer of the neural network have edge connections to most of the nodes in the subsequent layer.).

Regarding Claim 6, Sypniewski teaches all of the limitations of claim5. Sypniewski also teaches that the plurality of encoder sublayers comprise an encoder input embedding sublayer and an encoder positional embedding sublayer, wherein the encoder input embedding sublayer and the encoder positional embedding sublayer are fused into an encoder embedding kernel (Sypniewski Paragraph 86 - a second fully-connected stack 205 receives the output features from RNN stack 204 and produces a word embedding. Similar to first fully-connected stack 203, second fully-connected stack 205 serves several functions. In an embodiment, second fully-connected stack 205 reduces the dimensionality of the output of RNN stack 204 to something more concise. In an embodiment, second fully-connected stack 205 produces a word embedding of significantly reduced dimension compared to the output of RNN stack 204. This word embedding contains information related to the word predicted for a given time frame, and also information regarding words around the predicted word.);
and wherein the encoder embedding kernel obtains an input embedding by mapping the one or more audio sequences into an embedding vector based on a word embedding table, obtains a positional embedding corresponding to a position within the one or more audio sequences, and generates an encoder embedding vector by summing the input embedding and the positional embedding (Sypniewski Paragraph 87 - This word embedding, or word vector, representation is then passed to output stack 206. Output stack 206 has an output node for each word of a vocabulary and a blank or null output. For each frame of input audio data, output stack 206 produces a probability distribution over its output nodes for a word transcription or a null output. For each spoken word in the input audio, one frame of the output sequence will be desired to have a high probability prediction for a word of the vocabulary. All other frames of audio data that correspond to the word will be desired to contain the null or blank output. The alignment of a word prediction with the audio of the word is dependent on the hyperparameters of the various stacks and the data used for training.).

Regarding Claim 7, Sypniewski teaches all of the limitations of claim 6. Sypniewski also teaches that the encoder FC kernel loads a pre-combined weight matrix based on a first query matrix, a first key matrix, and a first value matrix, wherein the first query matrix is generated by packing a plurality of queries, the first key matrix is generated by packing a plurality of keys, the first value matrix is generated by packing a plurality of values, the plurality of queries, keys, and values are related to the plurality of encoder layers (Sypniewski Paragraph 110 - In an embodiment, training is performed on a batch of utterances at a time. In some embodiments, the utterances in a training batch must be of the same length. Having samples of the same length may simplify tensor operations performed in the forward propagation and backward propagation stages, which may be implemented in part through matrix multiplications with matrices of fixed dimension. For the matrix operations to be performed, it may be necessary that each of the training samples have the same length. The batch of training samples may be created by splitting an audio file into utterances, such as 7-10 second long portions which may correspond to a word, phrase, or series of words and/or phrases. In an audio file, naturally some utterances may be longer or shorter than others. In an embodiment where training samples must be the same length, techniques may be used to adjust the length of some of the samples.);
and wherein the plurality of sublayers further comprise an encoder matrix multiplication sublayer and an encoder concatenating sublayer, wherein the encoder matrix multiplication sublayer and the encoder concatenating sublayer are fused into an encoder multiplication kernel, and the encoder multiplication kernel generates an encoder multiplication output for a plurality of attention heads (Sypniewski Paragraph 110 - In an embodiment, training is performed on a batch of utterances at a time. In some embodiments, the utterances in a training batch must be of the same length. Having samples of the same length may simplify tensor operations performed in the forward propagation and backward propagation stages, which may be implemented in part through matrix multiplications with matrices of fixed dimension. For the matrix operations to be performed, it may be necessary that each of the training samples have the same length. The batch of training samples may be created by splitting an audio file into utterances, such as 7-10 second long portions which may correspond to a word, phrase, or series of words and/or phrases. In an audio file, naturally some utterances may be longer or shorter than others. In an embodiment where training samples must be the same length, techniques may be used to adjust the length of some of the samples.).

Regarding Claim 9, Sypniewski teaches all of the limitations of claim 7. Sypniewski also teaches that the plurality of encoder sublayers further comprise an encoder layer norm sublayer and an encoder additional FC sublayer, the encoder additional FC sublayer comprises a bias, wherein the encoder layer norm sublayer and the bias are fused into an encoder normalization kernel, and wherein the encoder layer norm sublayer receives a first sublayer input and generated a normalized first sublayer input, the encoder additional FC sublayer adds the normalized first sublayer input and the encoder embedding vector generated by the encoder embedding kernel (Sypniewski Paragraph 133 - A one-hot encoding is created for the phonetic representation of the word and the frequency of the word in the custom domain, optionally with normalization such as log normalization, is concatenated to the one-hot encoding. The resulting vector is input into the weights predictor 1320. The output vector provides the predicted weights. The predicted weights are used to replace the weights of the corresponding layer of the neural network in order to customize the neural network for the custom domain. If a word was unseen in the general training set, then a new node is added to the output layer and the weights of the node are initialized to be the predicted weights. In some embodiments, customized weights are predicted for all words in the vocabulary and not just words that occur with high frequency. Optionally, the neural network may be further trained on training examples that come from the custom domain.).

Regarding Claim 10, Sypniewski teaches all the limitations of claim 1. Sypniewski also teaches that the plurality of encoder sublayers comprises a fourth encoder fully connected (FC) sublayer, an encoder activation sublayer, and a fifth encoder FC sublayer, the fourth encoder FC sublayer comprises a first bias, the fifth encoder FC sublayer comprises a second bias, the encoder activation sublayer and the first bias are fused into an encoder activation kernel, and the second bias and a subsequent sublayer are fused into a single encoder kernel, wherein the subsequent sublayer subsequently follows the fifth encoder FC sublayer (Sypniewski Paragraph 121 - In an embodiment, the activation outputs of the nodes of the selector neural network layer portion 1110 are stored in a tensor. The activation outputs are output from the activation function of each node. Each element of the tensor may correspond to one node output. In selector neural network layer portion 1110 there are three nodes, which means that there are three output values stored in the tensor. The tensor of activation outputs is compared with all of the selectors 1115 in the expert knowledge store 1130. In an embodiment, the comparison is performed by using a distance metric. In an embodiment, the distance metric is the cosine similarity between the tensor of activation outputs and a selector 1115. In an embodiment, the distance metric is the dot product between the tensor of activation outputs and a selector 1115. The closest selector 1115 according to the distance metric is chosen as the correct row of the expert knowledge store. Backpropagation is performed based on the difference between those two values, the ground-truth output and the actual output of the neural network. The backpropagation is performed through the expert neural network layer inserted into gap 1120 just as if the expert neural network layer was a permanent part of neural network 1100 and adjusts the weights of each of the nodes of the expert neural network layer through training.).

Regarding Claim 11, Sypniewski teaches all of the limitations of claim 6. Sypniewski also teaches that the one or more decoder kernels comprise a decoder embedding kernel, the decoder embedding kernel receives a beam search output from a beam search kernel and generates a decoder embedding vector based on the beam search output (Sypniewski Paragraph 101 - At each iteration of the iterative beam search, the mappings or alignments of audio phonemes to text phonemes from the prior iteration are used as a starting point and then changed to create multiple new mappings or alignments, which are known as candidates. The candidates are scored and the n best-scoring candidates are selected for expansion at the next level of the iterative beam search, where n is the branching factor of the iterative beam search. By expanding only n candidates at each level, the algorithm avoids having to expand an exponential number of candidate nodes, which could be the case if a traditional breadth-first search is used.).

Regarding Claim 12, Sypniewski teaches all of the limitations of claim 11. Sypniewski also teaches that the decoder FC kernel loads a pre-combined weight matrix based on a second query matrix, a second key matrix, and a second value matrix, wherein the second query matrix is generated by packing a plurality of queries, the second key matrix is generated by packing a plurality of keys, the second key matrix is generated by packing a plurality of values, and the plurality of queries, keys, and values are related to the plurality of decoder layers (Sypniewski Paragraph 120 -  Example neural network 1100 is a fully-connected neural network with multiple layers of hidden states. Neural network layer portion 1110 is a selector and neural network layer portion 1120 is a gap with no hidden nodes and that is filled by swapping expert neural network layer portions in and out. After an audio file is input to the neural network system, whether for training or inference, forward propagation occurs as normal. When the gap 1120 is reached, forward propagation cannot continue until an expert layer is inserted. In order to select the expert layer, forward propagation occurs through selector neural network layer portion 1110 as normal. The activation outputs of the nodes of the selector neural network layer portion 1110 are used as a query to find the expert neural network layer to insert into gap 1120. Expert knowledge store 1130 stores selectors 1115 that each serve as an index for one expert neural network layer portion 1125 that corresponds to the selector. Each expert neural network layer may comprise the weights for the inbound edges to the nodes of the expert neural network layer and the activation function of the nodes.).

Regarding Claim 13, Sypniewski teaches all of the limitations of claim 12. Sypniewski also teaches that the plurality of decoder sublayers comprise a decoder matrix multiplication sublayer and a decoder concatenating sublayer, wherein the decoder matrix multiplication sublayer and the decoder concatenating sublayer are fused into a decoder multiplication kernel, and the decoder multiplication kernel generates a decoder multiplication output by concatenating a plurality of attention heads (Sypniewski Paragraph 84 - Bidirectional RNNs may therefore make current-frame predictions based on both preceding frames and following frames. In a unidirectional RNN, the tensors corresponding to frames are processed sequentially by the RNN in a single direction such as front to back or back to front. In a bidirectional RNN, the tensors corresponding to frames may be processed in both directions, front to back and back to front, with the information produced from the forward and backward runs combined at the end of processing, such as by concatenation, addition, or other operations.).

Regarding Claim 15, Sypniewski teaches all of the limitations of claim 11. Sypniewski also teaches that the decoder FC kernel loads a pre-combined weight matrix based on a second query matrix, a first key matrix, and a first value matrix, wherein the second query matrix is generated by packing a plurality of queries, the first key matrix is generated by packing a plurality of keys, the first value matrix is generated by packing a plurality of values, the plurality of keys and values are related to the plurality of encoder layers, and the plurality of queries are related to the plurality of decoder layers (Sypniewski Paragraph 120 -  Example neural network 1100 is a fully-connected neural network with multiple layers of hidden states. Neural network layer portion 1110 is a selector and neural network layer portion 1120 is a gap with no hidden nodes and that is filled by swapping expert neural network layer portions in and out. After an audio file is input to the neural network system, whether for training or inference, forward propagation occurs as normal. When the gap 1120 is reached, forward propagation cannot continue until an expert layer is inserted. In order to select the expert layer, forward propagation occurs through selector neural network layer portion 1110 as normal. The activation outputs of the nodes of the selector neural network layer portion 1110 are used as a query to find the expert neural network layer to insert into gap 1120. Expert knowledge store 1130 stores selectors 1115 that each serve as an index for one expert neural network layer portion 1125 that corresponds to the selector. Each expert neural network layer may comprise the weights for the inbound edges to the nodes of the expert neural network layer and the activation function of the nodes.).

Regarding Claim 16, Sypniewski teaches all of the limitations of claim 13. Sypniewski also teaches that the plurality of decoder sublayers further comprise a decoder layer norm sublayer and a decoder additional FC sublayer, wherein the decoder additional FC sublayer comprises a bias, wherein the decoder layer norm sublayer and the bias are fused into an decoder normalization kernel, wherein the decoder layer norm sublayer receives a first sublayer input and generated a normalized first sublayer input, the decoder additional FC sublayer adds the normalized first sublayer input and the decoder embedding vector generated by the decoder embedding kernel (Sypniewski Paragraph 133 - A one-hot encoding is created for the phonetic representation of the word and the frequency of the word in the custom domain, optionally with normalization such as log normalization, is concatenated to the one-hot encoding. The resulting vector is input into the weights predictor 1320. The output vector provides the predicted weights. The predicted weights are used to replace the weights of the corresponding layer of the neural network in order to customize the neural network for the custom domain. If a word was unseen in the general training set, then a new node is added to the output layer and the weights of the node are initialized to be the predicted weights. In some embodiments, customized weights are predicted for all words in the vocabulary and not just words that occur with high frequency. Optionally, the neural network may be further trained on training examples that come from the custom domain.);
and the plurality of decoder sublayers comprise a fourth decoder FC sublayer, a decoder activation sublayer, and a fifth decoder FC sublayer, the fourth decoder FC sublayer comprises a first bias and the fifth decoder FC sublayer comprise a second bias, the first bias and the decoder activation sublayer are fused into a decoder activation kernel, and the second bias and a subsequent sublayer are fused into a single decoder kernel, wherein the subsequent sublayer subsequently follows the fifth decoder FC sublayer (Sypniewski Paragraph 121 - In an embodiment, the activation outputs of the nodes of the selector neural network layer portion 1110 are stored in a tensor. The activation outputs are output from the activation function of each node. Each element of the tensor may correspond to one node output. In selector neural network layer portion 1110 there are three nodes, which means that there are three output values stored in the tensor. The tensor of activation outputs is compared with all of the selectors 1115 in the expert knowledge store 1130. In an embodiment, the comparison is performed by using a distance metric. In an embodiment, the distance metric is the cosine similarity between the tensor of activation outputs and a selector 1115. In an embodiment, the distance metric is the dot product between the tensor of activation outputs and a selector 1115. The closest selector 1115 according to the distance metric is chosen as the correct row of the expert knowledge store. Backpropagation is performed based on the difference between those two values, the ground-truth output and the actual output of the neural network. The backpropagation is performed through the expert neural network layer inserted into gap 1120 just as if the expert neural network layer was a permanent part of neural network 1100 and adjusts the weights of each of the nodes of the expert neural network layer through training.).

Regarding Claim 17, Sypniewski teaches an automatic speech recognition method, comprising: receiving, by an encoder, one or more audio sequences and generating an encoder output, wherein the encoder comprises a plurality of encoder layers sequentially executed by one or more graphic processing units (GPUs), wherein at least one encoder layer comprises a plurality of encoder sublayers fused into one or more encoder kernels, wherein the one or more encoder kernels respectively read from one of a first pair of ping-pong buffers and write into the other of the first pair of ping-pong buffers (Sypniewski Paragraph 66 - CNN stack 202 may include any number of convolutional layers, each including various size convolutional kernels. The relevant hyperparameters of CNN stack 202 include the dimension and number of CNN stack, the dimension and number of convolutional kernels at each layer, the stride of the convolutional kernels, and the number and function of any pooling stack. Convolutional kernels may be square, such as of size 5×5, or rectangular, such as of size 3×9, for example. Rectangular convolutional kernels that are ‘narrow’ along the time-axis may be more sensitive to features that are spread out over a wide range of frequencies but local to a short time period. Similarly, rectangular convolutional kernels that are ‘wider’ along the time-axis may detect acoustic features that are confined to a relatively narrow audio band but are of longer duration in time. Convolution kernels may also be referred to as windows, filters, or feature detectors. CNN in an encoder-decoder network.);
and receiving, by a decoder, a decoder input based on the encoder output and generating a decoder output, wherein the decoder comprises a plurality of decoder layers sequentially executed by one or more GPUs, and wherein at least one decoder layer comprises a plurality of decoder sublayers fused into one or more decoder kernels (Sypniewski Paragraph 66 - CNN stack 202 may include any number of convolutional layers, each including various size convolutional kernels. The relevant hyperparameters of CNN stack 202 include the dimension and number of CNN stack, the dimension and number of convolutional kernels at each layer, the stride of the convolutional kernels, and the number and function of any pooling stack. Convolutional kernels may be square, such as of size 5×5, or rectangular, such as of size 3×9, for example. Rectangular convolutional kernels that are ‘narrow’ along the time-axis may be more sensitive to features that are spread out over a wide range of frequencies but local to a short time period. Similarly, rectangular convolutional kernels that are ‘wider’ along the time-axis may detect acoustic features that are confined to a relatively narrow audio band but are of longer duration in time. Convolution kernels may also be referred to as windows, filters, or feature detectors. CNN in an encoder-decoder network.).

Regarding Claim 18, Sypniewski teaches all of the limitations of claim 17. Sypniewski also teaches that the plurality of encoder layers comprise a first encoder layer, an intermediate encoder layer, and a last encoder layer, wherein the first encoder layer receives the one or more audio sequences and generates a first encoder layer output, the intermediate encoder layer receives the first encoder layer output from the first encoder layer and generates an intermediate encoder layer output, the last encoder kernel receives the intermediate encoder layer output and generates the encoder output (Sypniewski Paragraph 72 - In some embodiments, the input to CNN stack 202 is a segment of frames of audio features produced by front-end module 201. For each output frame, a context of frames before and/or after the output frame may be included in the segment. For example, for each frame of audio, CNN stack 202 may operate on a ‘window’ of the 5 previous frames and the following 5 frames, for a total of 11 frames. In this example, if there are 40 audio features per frame, CNN stack 202 would then operate on an input having dimensions of 11×40.);
and wherein the decoder receives the decoder input at a current time step of a plurality of time steps, and the decoder input is based on the encoder output and a decoder output generated by the decoder at a time step prior to the current time step (Sypniewski Paragraph 81 - LSTM and GRU type RNNs include at least one back loop where the output activation of a neural network enters as an input to the neural network at the next time step. In other words, the output activation of at least one neural network node is an input to at least one neural network node of the same or a prior layer in a successive time step. More specifically, the LSTM or GRU compute a hidden state, comprising a vector, through a series of mathematical operations, which is produced as an output of the neural network at each time step. The hidden state is passed as an input to the next time step of the LSTM or GRU. In an embodiment, an LSTM has three inputs at a particular time step, the hidden step passed from the previous time step, the output tensor value of the previous time step, and the input frame or tensor representation of the frame of the current time step.).

Regarding Claim 19, Sypniewski teaches all of the limitations of claim 18. Sypniewski also teaches that the plurality of encoder layers further comprise one or more intermediate encoder layers, and the one or more intermediate encoder layers respectively receives an output generated from a previous intermediate encoder layer (Sypniewski Paragraph 55 - Neural networks comprise a plurality of neural network nodes organized in one or more layers. Each node has one or more inputs, an activation function, and an output. The inputs and output may generally be real number values. The inputs to the node are combined through a linear combination with weights and the activation function is applied to the result to produce the output. The output may be transmitted as an input to one or more other nodes in subsequent layers. The weights in the linear combination may be referred to as the weights of the node, and each node may have different weights. Neural network nodes may be organized in one or more layers. An input layer may comprise input nodes whose values may correspond to inputs to the neural network, without use of an activation function. An output layer may comprise one or more output nodes corresponding to output from the neural network. Neural network layers other than the input layer and output layer may be hidden layers, and the nodes in those layers may be referred to as hidden nodes.);
and wherein the one or more decoder kernels parallelly communicate with a decoder memory cache (Sypniewski Paragraph 157 - The buffered, un-augmented training dataset downloaded from the training data store 1610 to server 1640 may be stored temporarily or permanently on server 1640 to provide caching. When training process 1646 requests to train on the same training data, the training data augmentation system 1642 may check the cache to see if the training dataset is already buffered in local memory of the server 1640. If the training dataset is already present, the training data augmentation system may use the cached version of the training dataset, instead of fetching the training dataset from the training data store 1610. If the training dataset is not in the cache, then the training data augmentation system 1642 may initiate a fetch of the training dataset from the training data store 1610.).

Regarding Claim 20, Sypniewski teaches all of the limitations of claim 17. Sypniewski also teaches receiving, by a beam search kernel, the decoder output from the decoder (Sypniewski Paragraph 101 - At each iteration of the iterative beam search, the mappings or alignments of audio phonemes to text phonemes from the prior iteration are used as a starting point and then changed to create multiple new mappings or alignments, which are known as candidates. The candidates are scored and the n best-scoring candidates are selected for expansion at the next level of the iterative beam search, where n is the branching factor of the iterative beam search. By expanding only n candidates at each level, the algorithm avoids having to expand an exponential number of candidate nodes, which could be the case if a traditional breadth-first search is used.);
performing, by the beam search kernel, a beam search operation to generate a plurality of candidate symbols, wherein a number of the plurality of the candidate symbols is a pre-determined beam width (Sypniewski Paragraph 101 - At each iteration of the iterative beam search, the mappings or alignments of audio phonemes to text phonemes from the prior iteration are used as a starting point and then changed to create multiple new mappings or alignments, which are known as candidates. The candidates are scored and the n best-scoring candidates are selected for expansion at the next level of the iterative beam search, where n is the branching factor of the iterative beam search. By expanding only n candidates at each level, the algorithm avoids having to expand an exponential number of candidate nodes, which could be the case if a traditional breadth-first search is used.);
and sending, by the beam search kernel, the plurality of candidate symbols to a decoder embedding kernel of the decoder (Sypniewski Paragraph 101 - At each iteration of the iterative beam search, the mappings or alignments of audio phonemes to text phonemes from the prior iteration are used as a starting point and then changed to create multiple new mappings or alignments, which are known as candidates. The candidates are scored and the n best-scoring candidates are selected for expansion at the next level of the iterative beam search, where n is the branching factor of the iterative beam search. By expanding only n candidates at each level, the algorithm avoids having to expand an exponential number of candidate nodes, which could be the case if a traditional breadth-first search is used.).

Regarding Claim 21, Sypniewski teaches a non-transitory computer readable storage medium, comprising instructions stored therein, wherein, upon execution of the instructions by one or more processors, the instructions cause the one or more processors to perform acts comprising (Sypniewski Paragraph 53 - The apparatuses and methods described in this application may be partially or fully implemented by one or more computer programs executed by one or more processors. The computer programs include processor-executable instructions that are stored on at least one non-transitory tangible computer readable medium. The computer programs may also include and/or rely on stored data.):
receiving, by an encoder, one or more audio sequences and generating an encoder output, wherein the encoder comprises a plurality of encoder layers sequentially executed by the one or more processors, wherein at least one encoder layer comprises a plurality of encoder sublayers fused into one or more encoder kernels, wherein the one or more encoder kernels respectively read from one of a first pair of ping-pong buffers and write into the other of the first pair of ping-pong buffers (Sypniewski Paragraph 66 - CNN stack 202 may include any number of convolutional layers, each including various size convolutional kernels. The relevant hyperparameters of CNN stack 202 include the dimension and number of CNN stack, the dimension and number of convolutional kernels at each layer, the stride of the convolutional kernels, and the number and function of any pooling stack. Convolutional kernels may be square, such as of size 5×5, or rectangular, such as of size 3×9, for example. Rectangular convolutional kernels that are ‘narrow’ along the time-axis may be more sensitive to features that are spread out over a wide range of frequencies but local to a short time period. Similarly, rectangular convolutional kernels that are ‘wider’ along the time-axis may detect acoustic features that are confined to a relatively narrow audio band but are of longer duration in time. Convolution kernels may also be referred to as windows, filters, or feature detectors. CNN in an encoder-decoder network.);
and receiving, by a decoder, a decoder input based on the encoder output and generating a decoder output, wherein the decoder comprises a plurality of decoder layers sequentially executed by the one or more processors, and wherein at least one decoder layer comprises a plurality of decoder sublayers fused into one or more decoder kernels (Sypniewski Paragraph 66 - CNN stack 202 may include any number of convolutional layers, each including various size convolutional kernels. The relevant hyperparameters of CNN stack 202 include the dimension and number of CNN stack, the dimension and number of convolutional kernels at each layer, the stride of the convolutional kernels, and the number and function of any pooling stack. Convolutional kernels may be square, such as of size 5×5, or rectangular, such as of size 3×9, for example. Rectangular convolutional kernels that are ‘narrow’ along the time-axis may be more sensitive to features that are spread out over a wide range of frequencies but local to a short time period. Similarly, rectangular convolutional kernels that are ‘wider’ along the time-axis may detect acoustic features that are confined to a relatively narrow audio band but are of longer duration in time. Convolution kernels may also be referred to as windows, filters, or feature detectors. CNN in an encoder-decoder network.).

Regarding Claim 22, Sypniewski teaches all of the limitations of claim 21. Sypniewski also teaches receiving, by a beam search kernel, the decoder output from the decoder (Sypniewski Paragraph 101 - At each iteration of the iterative beam search, the mappings or alignments of audio phonemes to text phonemes from the prior iteration are used as a starting point and then changed to create multiple new mappings or alignments, which are known as candidates. The candidates are scored and the n best-scoring candidates are selected for expansion at the next level of the iterative beam search, where n is the branching factor of the iterative beam search. By expanding only n candidates at each level, the algorithm avoids having to expand an exponential number of candidate nodes, which could be the case if a traditional breadth-first search is used.);
 performing, by the beam search kernel, a beam search operation to generate a plurality of candidate symbols, wherein a number of the plurality of the candidate symbols is a pre-determined beam width (Sypniewski Paragraph 101 - At each iteration of the iterative beam search, the mappings or alignments of audio phonemes to text phonemes from the prior iteration are used as a starting point and then changed to create multiple new mappings or alignments, which are known as candidates. The candidates are scored and the n best-scoring candidates are selected for expansion at the next level of the iterative beam search, where n is the branching factor of the iterative beam search. By expanding only n candidates at each level, the algorithm avoids having to expand an exponential number of candidate nodes, which could be the case if a traditional breadth-first search is used.);
and sending, by the beam search kernel, the plurality of candidate symbols to a decoder embedding kernel (Sypniewski Paragraph 101 - At each iteration of the iterative beam search, the mappings or alignments of audio phonemes to text phonemes from the prior iteration are used as a starting point and then changed to create multiple new mappings or alignments, which are known as candidates. The candidates are scored and the n best-scoring candidates are selected for expansion at the next level of the iterative beam search, where n is the branching factor of the iterative beam search. By expanding only n candidates at each level, the algorithm avoids having to expand an exponential number of candidate nodes, which could be the case if a traditional breadth-first search is used.).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim(s) 8 and 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Sypniewski in view of John (US 20180253818 A1) (Further referred to as “John”).

Regarding Claim 8, Sypniewski teaches all of the limitations of claim 1. John further teaches that the plurality of sublayers comprise a scale sublayer, a masking sublayer, and a SoftMax sublayer, wherein the scale sublayer, the masking sublayer, and the SoftMax sublayer are fused into a single encoder kernel, and the masking sublayer performs a masking operation based on a pre-generated mask that is determined based a length of the audio sequence (John Paragraph 44 - The weight updates of backpropagation can be done via stochastic gradient descent in light of learning rates, cost functions, and stochastic terms. The choice of the cost function depends on factors such as the learning type (supervised, unsupervised, reinforcement, etc.) and the activation function. For example, when performing supervised learning on a multiclass classification problem, common choices for the activation function and cost function are the SoftMax function and cross entropy function, respectively. These can be used to output object bounding boxes in the form of a binary mask.).
Sypniewski and John are both considered to be analogous to the claimed invention because both are directed applying neural networks in order to better process information like audio signals. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the neural network speech recognition system of Cartwright with the support for the use of the backpropagation method of John because it would allow for better precision for the output. (John Paragraph 44 - They are also used for multi-scale regression to increase localization precision. DNN-based regression can learn features that capture geometric information in addition to being a good classifier such that they remove the limitation of designing a model which will capture parts and their relations explicitly, thereby helping to learn a wide variety of objects. The model consists of multiple layers, each of which has a rectified linear unit for non-linear transformation with some layers being convolutional, while others being fully connected.).

Regarding Claim 14, Sypniewski teaches all of the limitations of claim 1. John further teaches that the plurality of decoder sublayers comprise a scale sublayer, a masking sublayer, and a SoftMax sublayer, wherein the scale sublayer, the masking sublayer, and the SoftMax sublayer are fused into a single decoder kernel, and the masking sublayer performs a masking operation based on an attention mask that applies only on a decoder layer input that the at least one decoder layer has received (John Paragraph 44 - The weight updates of backpropagation can be done via stochastic gradient descent in light of learning rates, cost functions, and stochastic terms. The choice of the cost function depends on factors such as the learning type (supervised, unsupervised, reinforcement, etc.) and the activation function. For example, when performing supervised learning on a multiclass classification problem, common choices for the activation function and cost function are the SoftMax function and cross entropy function, respectively. These can be used to output object bounding boxes in the form of a binary mask.).
Sypniewski and John are both considered to be analogous to the claimed invention because both are directed applying neural networks in order to better process information like audio signals. Therefore it would have been obvious to someone of ordinary skill in the art before the effective filing date of the claimed invention to have modified the neural network speech recognition system of Cartwright with the support for the use of the backpropagation method of John because it would allow for better precision for the output. (John Paragraph 44 - They are also used for multi-scale regression to increase localization precision. DNN-based regression can learn features that capture geometric information in addition to being a good classifier such that they remove the limitation of designing a model which will capture parts and their relations explicitly, thereby helping to learn a wide variety of objects. The model consists of multiple layers, each of which has a rectified linear unit for non-linear transformation with some layers being convolutional, while others being fully connected.).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure: Lane et al. (US 20170236518 A1), J. Kim and W. Sung, "Multi-user real-time speech recognition with a GPU," 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, pp. 1617-1620, doi: 10.1109/ICASSP.2012.6288204. (Year: 2012), and I. Kim et al., "Development of highly accurate real-time large scale speech recognition system," 2015 IEEE International Conference on Consumer Electronics (ICCE), 2015, pp. 493-496, doi: 10.1109/ICCE.2015.7066496. (Year: 2015).
Lane et al. (US 20170236518 A1) teaches “a GPU-accelerated speech recognition engine optimized for faster than real time speech recognition on a scalable server-client heterogeneous CPU-GPU architecture, which is specifically optimized to simultaneously decode multiple users in real-time” (Lane – Abstract).
J. Kim and W. Sung, "Multi-user real-time speech recognition with a GPU," 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, pp. 1617-1620, doi: 10.1109/ICASSP.2012.6288204. (Year: 2012) teaches “a multi-user large vocabulary speech recognition system employing a fully composed one-level weighted finite state transducer (WFST) based network on a Graphics Processing Unit (GPU)” (Kim and Sung – Abstract).
I. Kim et al., "Development of highly accurate real-time large scale speech recognition system," 2015 IEEE International Conference on Consumer Electronics (ICCE), 2015, pp. 493-496, doi: 10.1109/ICCE.2015.7066496. (Year: 2015) teaches “the development of the framework and the algorithm for large scale automatic speech recognition systems” (Kim – Abstract).
Please, see additional references in form PTO-892 for more details.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to UTHEJ KUNAMNENI whose telephone number is (571)272-5428. The examiner can normally be reached M-F 8:00-5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on (571) 272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/UTHEJ KUNAMNENI/               Examiner, Art Unit 2656                                                                                                                                                                                         
/EDGAR X GUERRA-ERAZO/               Primary Examiner, Art Unit 2656