Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION
This office action is in response to correspondence 05/25/21 regarding application 16/516,849, in which claims 1-2, 4, 8, 10-14, and 17-20 were amended. Claims 1-20 are pending in the application and have been considered.

Response to Arguments
Amended claims 4 and 10 overcome the objections for minor informalities, and so they are withdrawn.
Applicant’s arguments on pages 7-12 regarding the 35 U.S.C. 103 rejections based on Rao, Catanzaro, Such, and Guevara have been considered but are moot in view of the new grounds for rejection, necessitated by Applicant’s amendments. 

Specification
In paragraph [0017] on page 7, should “temporally proceeds” be “temporally precedes”?


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.


Claims 1-5, 8-11, and 13-19 are rejected under 35 U.S.C. 103 as being unpatentable over Rao et al. (“Exploring Architectures, Data, and Units for Streaming End-to-End Speech Recognition with RNN-Transducer”, in Proc. ASRU 2017. 2 January 2018) in view of Li et al. (“Improving Layer Trajectory LSTM with Future Context Frames”. ICASSP, 12-17 May, 2019, Brighton, UK, pages 6550-6554).

Consider claim 1, Rao discloses a method for facilitating application of a unidirectional machine learning model to a stream of content (the RNN-T model consists of an encoder network, which maps input acoustic frames into a higher-level representation, and a prediction and joint network which together correspond to the decoder network. The decoder is conditioned on the history of previous predictions, i.e. contexts relative to the historical predictions, and is therefore considered a unidirectional model, Fig 1, page 2), the method comprising: identifying a stream of content (the RNN-T is used for streaming recognition, page 2, Section 1. Introduction); identifying, based on a predetermined segment length, one or more frame segments of content in the stream and prior to receiving all frame segments in the stream (during each step of inference, the RNN-T model is fed the next acoustic frame, page 3, section 2: RNN-Transducer); accessing the unidirectional machine learning model that includes a plurality of processing layers, including an initial input layer, one or more hidden 
Rao does not specifically mention: for at least one processing block in each hidden layer: (i) applying output of a previous processing block within a same hidden layer as input to the at least one processing block and (ii) applying an embedding vector as additional input to the at least one processing block, wherein the embedding vector is generated based on output from a plurality of processing blocks in a lower processing layer that is hierarchically lower than the corresponding hidden layer.
Li discloses for at least one processing block in each hidden layer: (i) applying output of a previous processing block within a same hidden layer as input to the at least one processing block and (ii) applying an embedding vector as additional input to the at least one processing block, wherein the embedding vector is generated based on output from a plurality of processing blocks in a lower processing layer that is hierarchically lower than the corresponding hidden layer (lookahead embedding through linear transform, section 3.1, Fig 3, page 6551-6552)
It would have been obvious to one of ordinary skill in the art to modify the invention of Rao such that for at least one processing block in each hidden layer: (i) applying output of a previous processing block within a same hidden layer as input to the at least one processing block and (ii) applying an 


Consider claim 13, Rao discloses: facilitating application of a unidirectional model to a stream of content by (the RNN-T model consists of an encoder network, which maps input acoustic frames into a higher-level representation, and a prediction and joint network which together correspond to the decoder network. The decoder is conditioned on the history of previous predictions, i.e. contexts relative to the historical predictions, and is therefore considered a unidirectional model, Fig 1, page 2); identifying a data stream (the RNN-T is used for streaming recognition, page 2, section 1. Introduction); identifying, based on a predetermined segment length, one or more frame segments of content in the stream and prior to receiving an entirety of the data stream (during each step of inference, the RNN-T model is fed the next acoustic frame, page 3, section 2: RNN-Transducer); accessing a unidirectional machine learning model that includes a plurality of processing layers, including an initial input layer and one or more hidden layers and a final output layer, each processing layer in the one or more hidden layers including a plurality of processing blocks that are sequentially positioned within each layer to provide output to a subsequent processing block and to receive input from a preceding processing block according to processing rules of the unidirectional machine learning model, where each processing block in the initial input layer receives an initial input as a frame segment from the stream (The encoder network is pre-trained as a hierarchical-CTC network simultaneously predicting phonemes, graphemes and wordpieces at 5, 10 and 12 LSTM layers respectively. … The decoder network is trained as a LSTM language model predicting wordpieces optimized with a cross-entropy loss. Finally, the RNN-T network 
Rao does not specifically mention a computer system, the system comprising: One or more processor(s); and One or more storage device having stored computer-executable instructions which are executable by the one or more processor(s) for configuring the computer system to: for at least one processing block in each hidden layer: (i) applying output of a previous processing block within a same hidden layer as input to the at least one processing block and (ii) applying an embedding vector as additional input to the at least one processing block, wherein the embedding vector is generated based on output from a plurality of processing blocks in a lower processing layer that is hierarchically lower than the corresponding hidden layer.
Li discloses a computer system, the system comprising: One or more processor(s); and One or more storage device having stored computer-executable instructions (implicit in running Microsoft Cortana, page 6552, section 4. Experiments) which are executable by the one or more processor(s) for configuring the computer system to: for at least one processing block in each hidden layer: (i) applying output of a previous processing block within a same hidden layer as input to the at least one processing block and (ii) applying an embedding vector as additional input to the at least one processing block, wherein the embedding vector is generated based on output from a plurality of processing blocks in a lower processing layer that is hierarchically lower than the corresponding hidden layer (lookahead embedding through linear transform, section 3.1, Fig 3, page 6551-6552)
It would have been obvious to one of ordinary skill in the art to modify the invention of Rao by including a computer system, the system comprising: One or more processor(s); and One or more storage device having stored computer-executable instructions which are executable by the one or more processor(s) for configuring the computer system to: for at least one processing block in each 


Consider claim 2, Rao discloses a frame segment length is a duration of time (T represents the number of frames in an utterance, page 1, section 1. Introduction). 

Consider claim 3, Rao discloses the stream of content input is a spoken utterance (an utterance, page 1, section 1. Introduction). 

Consider claim 4, Rao discloses the output is a senone classification (output label probabilities, graphemes and sub-words are output lexical units of the RNN-T models, which are senones because they are tied states within context-dependent phones that encode phone sequence information.  Page 3, Section 3. Units, Architectures and Training). 
Consider claim 5, Rao does not but Li implies, or at least suggests the stream of content is an audio file (test sets in a Microsoft system suggests the use of files, Section 4. Experiments, page 6552). 
It would have been obvious to one of ordinary skill in the art to modify the invention of Rao such that the stream of content is an audio file for reasons similar to those for claim 1.



Consider claim 9, Rao discloses the processing blocks are long-short term memory (LSTM) blocks (deep LSTM networks, page 3, section 3. Units, Architectures, and Training). 

Consider claim 10, Rao discloses the lower processing layer is a hidden layer between the initial input layer and the corresponding hidden layer (The encoder network is pre-trained as a hierarchical-CTC network simultaneously predicting phonemes, graphemes and wordpieces at 5, 10 and 12 LSTM layers respectively, Fig 2, pages 3-4, Section 3. Architecture). 

Consider claim 11, Rao does not, but Li discloses the plurality of processing blocks includes at least three processing blocks, including one processing block that is temporally offset from the at least one processing block by one frame segment length and one processing block that is offset from the at least one processing block by two frame segment lengths (page 6552, Fig 3). 
It would have been obvious to one of ordinary skill in the art to modify the invention of Rao such that the plurality of processing blocks includes at least three processing blocks, including one processing block that is temporally offset from the at least one processing block by one frame segment length and one processing block that is offset from the at least one processing block by two frame segment lengths for reasons similar to those for claim 1.

Consider claim 14, Rao discloses a frame segment length is a duration of time (T represents the number of frames in an utterance, page 1, section 1. Introduction) and the stream of content input is a spoken utterance (an utterance, page 1, section 1. Introduction). 

Consider claim 15, Rao discloses the output is a senone classification (output label probabilities, graphemes and sub-words are output lexical units of the RNN-T models, which are senones because they are tied states within context-dependent phones that encode phone sequence information.  Page 3, Section 3. Units, Architectures and Training).

Consider claim 16, Rao does not but Li implies, or at least suggests the stream of content is an audio file (test sets in a Microsoft system suggests the use of files, Section 4. Experiments, page 6552). 
It would have been obvious to one of ordinary skill in the art to modify the invention of Rao such that the stream of content is an audio file for reasons similar to those for claim 1.

Consider claim 17, Rao discloses the plurality of processing blocks utilize a recurrent neural network (RNN) (RNN-Transducer, page 2, section 2) and the processing blocks are long-short term memory (LSTM) blocks (deep LSTM networks, page 3, section 3. Units, Architectures, and Training).

Consider claim 18, Rao discloses the lower processing layer is a hidden layer between the initial input layer of the unidirectional machine learning model and the corresponding hidden layer (The encoder network is pre-trained as a hierarchical-CTC network simultaneously predicting phonemes, graphemes and wordpieces at 5, 10 and 12 LSTM layers respectively, Fig 2, pages 3-4, Section 3. Architecture). 

Consider claim 19, Rao does not, but Li discloses the plurality of processing blocks includes at least three processing blocks, including one processing block that is temporally offset from the at least 
It would have been obvious to one of ordinary skill in the art to modify the invention of Rao such that the plurality of processing blocks includes at least three processing blocks, including one processing block that is temporally offset from the at least one processing block by one frame segment length and one processing block that is offset from the at least one processing block by two frame segment lengths for reasons similar to those for claim 1.

Claims 6 and 7 are rejected under 35 U.S.C. 103 as being unpatentable over Rao et al. (“Exploring Architectures, Data, and Units for Streaming End-to-End Speech Recognition with RNN-Transducer”, in Proc. ASRU 2017) in view of Li et al. (“Improving Layer Trajectory LSTM with Future Context Frames”. ICASSP, 12-17 May, 2019, Brighton, UK, pages 6550-6554), in further view of Such et al. (2018/0137349)

Consider claim 6, Rao and Li do not, but Such discloses the stream of content is textual data (stream of text, [0079]).
It would have been obvious to one of ordinary skill in the art to modify the invention of Rao and Li such that the stream of content is textual data in order to improve handwriting recognition, as suggested by Such ([0003]-[0004]).

Consider claim 7, Rao and Li do not, but Such discloses the stream of content is handwritten data (stream of handwritten text, [0079]).
It would have been obvious to one of ordinary skill in the art to modify the invention of Rao and Li such that the stream of content is textual data for reasons similar to those for claim 6.

Claims 12 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Rao et al. (“Exploring Architectures, Data, and Units for Streaming End-to-End Speech Recognition with RNN-Transducer”, in Proc. ASRU 2017) in view of Li et al. (“Improving Layer Trajectory LSTM with Future Context Frames”. ICASSP, 12-17 May, 2019, Brighton, UK, pages 6550-6554), in further view of Guevara et al. (2020/0020322).

Consider claim 12, Rao does not, but Li discloses the plurality of processing blocks includes at least four processing blocks, including at least one processing block that is temporally offset from the at least one processing block by at least two segment lengths (page 6552, Fig 3).
It would have been obvious to one of ordinary skill in the art to modify the invention of Rao such that the plurality of processing blocks includes at least four processing blocks, including at least one processing block that is temporally offset from the at least one processing block by at least two segment lengths for reasons similar to those for claim 1.
Rao and Li do not specifically mention a processing block that is offset from the at least one processing block by at least three frame segment lengths.
Guevara discloses a processing block that is offset from the at least one processing block by at least three frame segment lengths ([0029], Fig 3A).
It would have been obvious to one of ordinary skill in the art to modify the invention of Rao and Li such that a processing block that is offset from the at least one processing block by at least three frame segment lengths in order to address the complexities of the interconnected components necessary for hotword detection, as suggested by Guevara ([0003]).


It would have been obvious to one of ordinary skill in the art to modify the invention of Rao such that the plurality of processing blocks includes at least four processing blocks, including at least one processing block that is temporally offset from the at least one processing block by at least two segment lengths for reasons similar to those for claim 1.
Rao and Li do not specifically mention a processing block that is offset from the at least one processing block by at least three frame segment lengths.
Guevara discloses a processing block that is offset from the at least one processing block by at least three frame segment lengths ([0029], Fig 3A).
It would have been obvious to one of ordinary skill in the art to modify the invention of Rao and Li such that a processing block that is offset from the particular processing block by at least three frame segment lengths for reasons similar to those for claim 12.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Jesse Pullias whose telephone number is 571/270-5135. The examiner can normally be reached on M-F 8:00 AM - 4:30 PM. The examiner’s fax number is 571/270-6135.

Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

If attempts to reach the examiner by telephone are unsuccessful, the examiner's supervisor, Dan Washburn can be reached on 571/272-5551. 

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).


/Jesse S Pullias/
Primary Examiner, Art Unit 2657                                                  06/03/21