Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION
This office action is in response to application 16/516,849, which was filed 07/19/19. Claims 1-20 are pending in the application and have been considered.

Claim Objections
Claim 4 is objected to because of the following informalities:  the examiner assumes “The method in claim 4” should be “The method in claim 1”.  Appropriate correction is required.
Claim 10 is objected to because of the following informalities:  the examiner assumes “between initial layer” should be “between an initial layer”.  Appropriate correction is required.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.


Claims 1-5, 8-11, and 13-19 are rejected under 35 U.S.C. 103 as being unpatentable over Rao et al. (“Exploring Architectures, Data, and Units for Streaming End-to-End Speech Recognition with RNN-Transducer”, in Proc. ASRU 2017. 2 January 2018) in view of Catanzaro et al. (2017/0148433).

Consider claim 1, Rao discloses a method for facilitating the application of a unidirectional model to a stream of content by applying contexts at processing blocks in the unidirectional model (the RNN-T model consists of an encoder network, which maps input acoustic frames into a higher-level representation, and a prediction and joint network which together correspond to the decoder network. The decoder is conditioned on the history of previous predictions, i.e. contexts relative to the historical predictions, and is therefore considered a unidirectional model, Fig 1, page 2), the method comprising: identifying a stream of content (the RNN-T is used for streaming recognition, page 2, section 1. Introduction); identifying, based on a predetermined segment length, one or more frame segments of content in the stream and prior to receiving all frame segments in the stream (during each step of inference, the RNN-T model is fed the next acoustic frame, page 3, section 2: RNN-Transducer); accessing a unidirectional machine learning model that includes a plurality of processing layers, including an initial input layer and one or more hidden layers and a final output layer, each processing layer in the plurality of processing layers including a plurality of processing blocks that are sequentially positioned within each layer to provide output to a subsequent processing block and to receive input from a preceding processing block according to processing rules of the unidirectional machine learning model, where each processing block in the input layer receives an initial input as a frame segment from the stream (The encoder network is pre-trained as a hierarchical-CTC network simultaneously predicting phonemes, graphemes and wordpieces at 5, 10 and 12 LSTM layers respectively. … The decoder network is trained as a LSTM language model predicting wordpieces optimized with a cross-entropy loss. Finally, 
Rao does not specifically mention applying a future context, at least one of the plurality of processing blocks being temporally offset from the particular processing block by at least one frame segment length and that is temporally aligned with a corresponding frame segment that occurs prior to a frame segment that is temporally aligned with the particular processing block, the application of the future context causing output from the unidirectional machine learning model to be temporally offset from input of the stream into the unidirectional machine learning model. 
Catanzaro discloses applying a future context (FIG. 7, which depicts a row convolution architecture with future context size of 2, [0110]), at least one of the plurality of processing blocks being temporally offset from the particular processing block by at least one frame segment length and that is temporally aligned with a corresponding frame segment that occurs prior to a frame segment that is 
information is only needed to make an accurate prediction at the current time-step.  Suppose at time-step t, future contexts are used. Since the convolution-like operation in Eq.  11 is row oriented for both the parameter matrix W and the feature matrix, this layer is called row convolution, [0110-0111] In other words, the current output is temporally offset from the future context frames being used).  
It would have been obvious to one of ordinary skill in the art to modify the invention of Rao by applying a future context, at least one of the plurality of processing blocks being temporally offset from the particular processing block by at least one frame segment length and that is temporally aligned with a corresponding frame segment that occurs prior to a frame segment that is temporally aligned with the particular processing block, the application of the future context causing output from the unidirectional machine learning model to be temporally offset from input of the stream into the unidirectional machine learning model in order to reduce latency, as suggested by Catanzaro ([0109]).


Consider claim 13, Rao discloses: facilitating the application of a unidirectional model to a stream of content by applying contexts at processing blocks in the unidirectional model (the RNN-T model consists of an encoder network, which maps input acoustic frames into a higher-level representation, and a prediction and joint network which together correspond to the decoder network. The decoder is conditioned on the history of previous predictions, i.e. contexts relative to the historical predictions, and is therefore considered a unidirectional model, Fig 1, page 2); identifying a stream of content (the RNN-T is used for streaming recognition, page 2, section 1. Introduction); identifying, based 
Rao does not specifically mention a computer system, the system comprising: One or more processor(s); and One or more storage device having stored computer-executable instructions which are executable by the one or more processor(s) for causing the computer system to implement a method for processing the data stream, the method comprising applying a future context, at least one of the plurality of processing blocks being temporally offset from the particular processing block by at least one frame segment length and that is temporally aligned with a corresponding frame segment that occurs prior to a frame segment that is temporally aligned with the particular processing block, the application of the future context causing output from the unidirectional machine learning model to be temporally offset from input of the stream into the unidirectional machine learning model. 
Catanzaro discloses a computer system, the system comprising: One or more processor(s); and One or more storage device having stored computer-executable instructions which are executable by the one or more processor(s) for causing the computer system to implement a method for processing the data stream (computing system 1800, Fig 18), the method comprising applying a future context (FIG. 7, which depicts a row convolution architecture with future context size of 2, [0110]), at least one of the plurality of processing blocks being temporally offset from the particular processing block by at least one frame segment length and that is temporally aligned with a corresponding frame segment that occurs prior to a frame segment that is temporally aligned with the particular processing block, the application of the future context causing output from the unidirectional machine learning model to be temporally offset from input of the stream into the unidirectional machine learning model (the row convolution layer 710 is placed above all recurrent layers (e.g., 720).  The intuition behind this layer is that a small portion of future information is only needed to make an accurate prediction at the current time-step.  Suppose at time-step t, future contexts are used. Since the convolution-like operation in Eq.  11 is row 
It would have been obvious to one of ordinary skill in the art to modify the invention of Rao by including a computer system, the system comprising: One or more processor(s); and One or more storage device having stored computer-executable instructions which are executable by the one or more processor(s) for causing the computer system to implement a method for processing the data stream, the method comprising applying a future context, at least one of the plurality of processing blocks being temporally offset from the particular processing block by at least one frame segment length and that is temporally aligned with a corresponding frame segment that occurs prior to a frame segment that is temporally aligned with the particular processing block, the application of the future context causing output from the unidirectional machine learning model to be temporally offset from input of the stream into the unidirectional machine learning model for reasons similar to those for claim 1.

Consider claim 2, Rao discloses the segment length is a duration of time (T represents the number of frames in an utterance, page 1, section 1. Introduction). 

Consider claim 3, Rao discloses the stream of content input is a spoken utterance (an utterance, page 1, section 1. Introduction). 

Consider claim 4, Rao discloses the output is a senone classification (output label probabilities, graphemes and sub-words are output lexical units of the RNN-T models, which are senones because they are tied states within context-dependent phones that encode phone sequence information.  Page 3, Section 3. Units, Architectures and Training). 
Consider claim 5, Rao does not but Catanzaro discloses the stream of content is an audio file (streaming input comprising an ordered sequence of packets of fixed or variable length [0205], computer storage and usage using a file system [0256]). 
It would have been obvious to one of ordinary skill in the art to modify the invention of Rao such that the stream of content is an audio file for reasons similar to those for claim 1.

Consider claim 8, Rao discloses the machine learning processing blocks utilize a recurrent neural network (RNN) (RNN-Transducer, page 2, section 2). 

Consider claim 9, Rao discloses the processing blocks are long-short term memory (LSTM) blocks (deep LSTM networks, page 3, section 3. Units, Architectures, and Training). 

Consider claim 10, Rao discloses the lower layer is a hidden layer between initial layer and the particular layer (The encoder network is pre-trained as a hierarchical-CTC network simultaneously predicting phonemes, graphemes and wordpieces at 5, 10 and 12 LSTM layers respectively, Fig 2, pages 3-4, Section 3. Architecture). 

Consider claim 11, Rao does not, but Catanzaro discloses the plurality of processing blocks includes at least three processing blocks, including one processing block that is temporally offset from the particular processing block by one frame segment length and one processing block that is offset from the particular processing block by two frame segment lengths (future context size of 2, [0110], Fig 7). 


Consider claim 14, Rao discloses the segment length is a duration of time (T represents the number of frames in an utterance, page 1, section 1. Introduction) and the stream of content input is a spoken utterance (an utterance, page 1, section 1. Introduction). 

Consider claim 15, Rao discloses the output is a senone classification (output label probabilities, graphemes and sub-words are output lexical units of the RNN-T models, which are senones because they are tied states within context-dependent phones that encode phone sequence information.  Page 3, Section 3. Units, Architectures and Training).

Consider claim 16, Rao does not but Catanzaro discloses the stream of content is an audio file (streaming input comprising an ordered sequence of packets of fixed or variable length [0205], computer storage and usage using a file system [0256]). 
It would have been obvious to one of ordinary skill in the art to modify the invention of Rao such that the stream of content is an audio file for reasons similar to those for claim 1.

Consider claim 17, Rao discloses the machine learning processing blocks utilize a recurrent neural network (RNN) (RNN-Transducer, page 2, section 2) and the processing blocks are long-short term memory (LSTM) blocks (deep LSTM networks, page 3, section 3. Units, Architectures, and Training).

Consider claim 18, Rao discloses the lower layer is a hidden layer between initial layer and the particular layer (The encoder network is pre-trained as a hierarchical-CTC network simultaneously predicting phonemes, graphemes and wordpieces at 5, 10 and 12 LSTM layers respectively, Fig 2, pages 3-4, Section 3. Architecture). 

Consider claim 19, Rao does not, but Catanzaro discloses the plurality of processing blocks includes at least three processing blocks, including one processing block that is temporally offset from the particular processing block by one frame segment length and one processing block that is offset from the particular processing block by two frame segment lengths (future context size of 2, [0110], Fig 7). 
It would have been obvious to one of ordinary skill in the art to modify the invention of Rao such that the plurality of processing blocks includes at least three processing blocks, including one processing block that is temporally offset from the particular processing block by one frame segment length and one processing block that is offset from the particular processing block by two frame segment lengths for reasons similar to those for claim 1.

Claims 6 and 7 are rejected under 35 U.S.C. 103 as being unpatentable over Rao et al. (“Exploring Architectures, Data, and Units for Streaming End-to-End Speech Recognition with RNN-Transducer”, in Proc. ASRU 2017) in view of Catanzaro et al. (2017/0148433), in further view of Such et al. (2018/0137349)

Consider claim 6, Rao and Catanzaro do not, but Such discloses the stream of content is textual data (stream of text, [0079]).


Consider claim 7, Rao and Catanzaro do not, but Such discloses the stream of content is handwritten data (stream of handwritten text, [0079]).
It would have been obvious to one of ordinary skill in the art to modify the invention of Rao and Catanzaro such that the stream of content is textual data for reasons similar to those for claim 6.

Claims 12 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Rao et al. (“Exploring Architectures, Data, and Units for Streaming End-to-End Speech Recognition with RNN-Transducer”, in Proc. ASRU 2017) in view of Catanzaro et al. (2017/0148433), in further view of Guevara et al. (2020/0020322).

Consider claim 12, Rao does not, but Catanzaro discloses the plurality of processing blocks includes at least four processing blocks, including at least one processing block that is temporally offset from the particular processing block by at least two segment lengths (future context size of 2, [0110], Fig 7).
It would have been obvious to one of ordinary skill in the art to modify the invention of Rao such that the plurality of processing blocks includes at least four processing blocks, including at least one processing block that is temporally offset from the particular processing block by at least two segment lengths for reasons similar to those for claim 1.
Rao and Catanzaro do not specifically mention a processing block that is offset from the particular processing block by at least three frame segment lengths.

It would have been obvious to one of ordinary skill in the art to modify the invention of Rao and Catanzaro such that a processing block that is offset from the particular processing block by at least three frame segment lengths in order to address the complexities of the interconnected components necessary for hotword detection, as suggested by Guevara ([0003]).

Consider claim 20, Rao does not, but Catanzaro discloses the plurality of processing blocks includes at least four processing blocks, including at least one processing block that is temporally offset from the particular processing block by at least two segment lengths (future context size of 2, [0110], Fig 7).
It would have been obvious to one of ordinary skill in the art to modify the invention of Rao such that the plurality of processing blocks includes at least four processing blocks, including at least one processing block that is temporally offset from the particular processing block by at least two segment lengths for reasons similar to those for claim 1.
Rao and Catanzaro do not specifically mention a processing block that is offset from the particular processing block by at least three frame segment lengths.
Guevara discloses a processing block that is offset from the particular processing block by at least three frame segment lengths ([0029], Fig 3A).
It would have been obvious to one of ordinary skill in the art to modify the invention of Rao and Catanzaro such that a processing block that is offset from the particular processing block by at least three frame segment lengths for reasons similar to those for claim 12.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
2018/0165572 Yoo discloses speech recognition by extracting target data corresponding to a current window and padding data subsequent to the target data from sequence data; acquiring a state parameter corresponding to a previous window; and calculating a recognition result for the current window based on the state parameter, the extracted target data, and the extracted padding data using a recurrent model
Sak et al. (“Recurrent Neural Aligner: An Encoder-Decoder Neural Network Model for
Sequence to Sequence Mapping”. Interspeech 2017, August 20-24) discloses an encoder-decoder recurrent neural network model called Recurrent Neural Aligner (RNA) that can be used for sequence to sequence mapping tasks
He et al. (“Streaming End-to-End Speech Recognition for Mobile Devices”. ICASSP 2019, May 12-May 17, 2019), discloses building an E2E speech recognizer using a recurrent neural network transducer
Moritz et al. (“Triggered Attention for End-to-End Speech Recognition”. ICASSP 2019, May 12-May 17, 2019) discloses end-to-end automatic speech recognition (ASR) that combines the alignment capabilities of the connectionist temporal classification (CTC) approach and the modeling strength of the attention mechanism

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Jesse Pullias whose telephone number is 571/270-5135. The examiner can normally be reached on M-F 8:00 AM - 4:30 PM. The examiner’s fax number is 571/270-6135.

Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

If attempts to reach the examiner by telephone are unsuccessful, the examiner's supervisor, Dan Washburn can be reached on 571/272-5551. 

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).


/Jesse S Pullias/
Primary Examiner, Art Unit 2657                                                  02/22/21