DETAILED ACTION

Notice of  AIA  Status
The present application, filed on or after October 21, 2020, is being examined under the first inventor to file provisions of the AIA . 

Information Disclosure Statement
The information disclosure statements (IDS) submitted on 05/25/2021 and 12/14/2021 have being considered by the examiner. 
Examiner’s note: All the non-patent literature documents were not in the right format. Applicant is advised to follow the right format which are as follows: include name of the author (in CAPITAL LETTERS), title of the article (when appropriate), title of the item(book, magazine, journal, serial, symposium, catalog, etc), date, pages(s), volume-issue number(s),publisher, city and/or country where published. 

Abstract
The abstract of the disclosure is objected to because of the following informalities:
The abstract should be in narrative form and generally limited to a single paragraph preferably within the range of 50 to 150 words (MPEP § 608.01(b)).   The abstract of the disclosure is consisted of 162 words. Appropriate correction is required.


Drawing
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(5) because they include the following reference character(s) not mentioned in the description:  All items in Fig.9 is mislabeled, in paragraph [00124] through paragraph [00132]. For example, line 1 in paragraph[00124] recites  “Figure 9 is a block diagram illustrating a simplified example implementation of a computing system 100”, whereas in Fig. 9 “1100 computing system” is described. Another example is, line 1 in paragraph [00129] recites “The computing system 100 may include one or more network interfaces 122”, whereas in Fig.9, network interface is “1122”. Appropriate correction is required.

    PNG
    media_image1.png
    552
    760
    media_image1.png
    Greyscale




Specification
Paragraph [0007], line 5 and Paragraph [0035], line 7, the term “transformer-based ARS system” should be changed to “ “transformer-based ASR system” for typographical/grammar issues and consistency of terminology. Appropriate correction is required.


Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.

The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:

A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1- 20 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Steedman et al.  (US 20210141799 A1), hereinafter referenced as Steedman.

Regarding Claim 1, Steedman teaches a computer implemented method for automatic speech recognition using a transformer-based ASR system (Para.[0059],[0132]), comprising:
obtaining a first speech sequence that comprises a first set of speech frame feature vectors that each represent a respective speech frame that corresponds to a respective time step (Para.[0036]-[0039]); 
processing the first speech sequence, using a time reduction operation of an encoder NN (Para.[0245]),
into a second speech sequence that comprises a second set of speech frame feature vectors that each concatenate information from a respective plurality of the speech frame feature vectors included in the first set (Para.[0134]),
wherein the second speech sequence includes fewer speech frame feature vectors than the first speech sequence (Para.[0135]);
transforming the second speech sequence, using a self-attention operation of the encoder NN, into a third speech sequence that comprises a third set of speech frame feature vectors (Para.[0178]);
processing the third speech sequence, using a probability operation of the encoder NN, to predict a sequence of first labels corresponding to the third set of speech frame feature vectors (Para.[0177],[0178]); and
processing the third speech sequence using a decoder NN to predict a sequence of second labels corresponding to the third set of speech frame feature vectors (Para.[0177],[0178]).

Regarding Claim 2, Steedman teaches, the method of claim 1, Steedman further discloses comprising, during a training stage of the encoder NN and the decoder NN (Para.[0133]): 
computing a loss function based on the predicted sequence of first labels and the predicted sequence of second labels (Para.[0194]); and
performing back propagation using gradient decent to update learnable parameters of the decoder NN and the encoder NN to reduce the loss function (Para.[0194], [0222]).

Regarding claim 12, claim 12 is similar in scope and content of claim 2, and therefore, is rejected under similar rationale.

Regarding Claim 3, Steedman teaches, the method of claim 2, Steedman further discloses, comprising, during an inference stage, computing a sequence of labels for the third speech sequence based on the predicted sequence of first labels and the predicted sequence of second labels (Para.[0177],[0178]).

Regarding claim 13, claim 13 is similar in scope and content of claim 3, and therefore, is rejected under similar rationale.

Regarding Claim 4, Steedman teaches, the method of claim 1, Steedman further discloses wherein obtaining the first speech sequence comprises: obtaining an input speech sequence that comprises an input set of speech frame feature vectors that each represent a respective speech frame that corresponds to a respective time step (Para.[0036]-[0039]); and
processing the input speech sequence, using a subsampling operation of the encoder NN, into the first speech sequence, wherein the first speech sequence includes fewer speech frame feature vectors than the input speech sequence (Para.[0245]).

Regarding claim 14, claim 14 is similar in scope and content of claim 4, and therefore, is rejected under similar rationale.

Regarding Claim 5, Steedman teaches, the method of claim 1, Steedman further discloses wherein the time reduction operation is performed using one or more linear NN layers of the encoder NN.  (Para.[0245]).

Regarding Claim 6, Steedman teaches, the method of claim 1, Steedman further discloses wherein obtaining the first speech sequence comprises: obtaining an initial speech sequence that comprises an initial set of speech frame feature vectors that each represent a respective speech frame that corresponds to a respective time step (Para.[0036]-[0039]); 
and processing the first speech sequence, using a further self-attention operation of the encoder NN that precedes the time reduction operation, into the first speech sequence ( Para. [0139]-[142]).

Regarding claim 15, claim 15 is similar in scope and content of claim 6, and therefore, is rejected under similar rationale.

Regarding Claim 7, Steedman teaches, the method of claim 6, Steedman further discloses wherein obtaining the initial speech sequence comprises: obtaining an input speech sequence that comprises an input set of speech frame feature vectors that each represent a respective speech frame that corresponds to a respective time step (Para.[0036]-[0039]);
 and  processing the input speech sequence, using a subsampling operation of the encoder NN, into the initial speech sequence, wherein the initial speech sequence includes fewer speech frame feature vectors than the input speech sequence (Para.[0245]).  

Regarding claim 16, claim 16 is similar in scope and content of claim 7, and therefore, is rejected under similar rationale.

Regarding Claim 8, Steedman teaches, the method of claim 6, Steedman further discloses wherein the self-attention operation and the further self-attention operation are each performed by respective sub-networks of self- attention layers ( Para. [0139]-[142]).

Regarding claim 17, claim 17 is similar in scope and content of claim 8, and therefore, is rejected under similar rationale.

Regarding Claim 9, Steedman teaches, the method of claim 8, Steedman further discloses comprising using a respective number of self-attention layers for each of the self-attention operation and the further self-attention operation based on obtained hyperparameters ( Para. [0139]-[142]). 

Regarding claim 18, claim 18 is similar in scope and content of claim 9, and therefore, is rejected under similar rationale.
 
Regarding Claim 10, Steedman discloses an automatic speech recognition computer system comprising: 
a storage storing executable instructions (Para.[0055], [0061]); 
and a processing device (Para.[0056); 
in communication with the storage, the processing device configured to execute the instructions (Para.[0057],[0061]) to cause the computing system to: 
obtain a first speech sequence that comprises a first set of speech frame feature vectors that each represent a respective speech frame that corresponds to a respective time step (Para.[0036]-[0039]);
process the first speech sequence, using a time reduction operation (Para.[0245]),
into a second speech sequence that comprises a second set of speech frame feature vectors that each concatenate information from a respective plurality of the speech frame feature vectors included in the first set (Para.[0134]), 
wherein the second speech sequence includes fewer speech frame feature vectors than the first speech sequence (Para.[0135]); 
transform the second speech sequence, using a self-attention operation, into a third speech sequence that comprises a third set of speech frame feature vectors (Para.[0178]); 
process the third speech sequence, using a probability operation, to predict a sequence of first labels corresponding to the third set of speech frame feature vectors (Para.[0177],[0178]);  
and  process the third speech sequence using a further self-attention operation and a further probability operation to predict a sequence of second labels corresponding to the third set of speech frame feature vectors (Para.[0177],[0178]).
  
Regarding Claim 11, Steedman teaches, the method of claim 10, Steedman further discloses 
wherein the processing device is configured to execute the instructions to cause the computing system to implement a encoder neural network (NN) and a decoder NN (Para. [0133]), 
wherein the time reduction operation, self- attention operation and probability operation are each performed using respective sub-networks of the encoder NN( Para.[0245], [0177], [0178])
and the further self-attention operation and further probability operation are each performed using respective sub-networks of the decoder NN(Para.[0177], [0178]).

Regarding Claim 19, Steedman discloses a computer readable medium that stored computer instructions that (Para.[0055], [0061]), 
when executed by a processing device of a computer system cause the computer system (Para.[0057],[0061]) 
to: obtain a first speech sequence that comprises a first set of speech frame feature vectors that each represent a respective speech frame that corresponds to a respective time step (Para.[0036]-[0039]);
process the first speech sequence, using a time reduction operation of an encoder NN (Para.[0245]), 
into a second speech sequence that comprises a second set of speech frame feature vectors that each concatenate information from a respective plurality of the speech frame feature vectors included in the first set (Para.[0134]),
wherein the second speech sequence includes fewer speech frame feature vectors than the first speech sequence (Para.[0135]);
transform the second speech sequence, using a self-attention operation of the encoder NN, into a third speech sequence that comprises a third set of speech frame feature vectors (Para.[0178]);
process the third speech sequence, using a probability operation of the encoder NN, to predict a sequence of first labels corresponding to the third set of speech frame feature vectors (Para.[0177],[0178]);  
and process the third speech sequence using a decoder NN to predict a sequence of second labels corresponding to the third set of speech frame feature vectors(Para.[0177],[0178]).    

Regarding Claim 20, Steedman discloses an automated speech recognition system comprising: an encoder neural network (Para. [0132]) 
configured to process a first speech sequence that comprises a first set of speech frame feature vectors that each represent a respective speech frame that corresponds to a respective time step (Para.[0036]-[0039]),  
the encoder neural network implementing:  a time reduction operation transforming the first speech sequence into a second speech sequence that comprises a second set of speech frame feature vectors that each concatenate information from a respective plurality of the speech frame feature vectors included in the first set (Para.[0245], [0134),
wherein the second speech sequence includes fewer speech frame feature vectors than the first speech sequence (Para.[0135]); 
a self-attention operation transforming the second speech sequence, using a self-attention mechanism, into a third speech sequence that comprises a third set of speech frame feature vectors (Para.[0178]); 
a probability operation predicting a sequence of first labels corresponding to the third set of speech frame feature vectors(Para.[0177],[0178]); 
and a decoder neural network processing the third speech sequence to predict a sequence of second labels corresponding to the third set of speech frame feature vectors (Para.[0177],[0178]).


Conclusion
Listed below are the prior arts made of record and not relied upon but are considered pertinent to applicant's disclosure.
Jaitly et al.  (US 20190236451 A1) A speech recognition neural network system includes an encoder neural network and a decoder neural network. The encoder neural network generates an encoded sequence from an input acoustic sequence that represents an utterance. The input acoustic sequence includes a respective acoustic feature representation at each of a plurality of input time steps, the encoded sequence includes a respective encoded representation at each of a plurality of time reduced time steps, and the number of time reduced time steps is less than the number of input time steps..……,Abstract.
Peyser et al. (US 20200349922 A1) A method for generating final transcriptions representing numerical sequences of utterances in a written domain includes receiving audio data for an utterance containing a numeric sequence, and decoding, using a sequence-to-sequence speech recognition model, the audio data for the utterance to generate, as output from the sequence-to-sequence speech recognition model, an intermediate transcription of the utterance..…….Abstract.
 Fan et al.  (US 10923111 B1) A system configured to recognize text represented by speech may determine that a first portion of audio data corresponds to speech from a first speaker and that a second portion of audio data corresponds to speech from the first speaker and a second speaker. Features of the first portion are compared to features of the second portion to determine a similarity therebetween.……,Abstract.
Zhao et al. (US 20200357388 A1) A method includes receiving audio data encoding an utterance, processing, using a speech recognition model, the audio data to generate speech recognition scores for speech elements, and determining context scores for the speech elements based on context data indicating a context for the utterance.…….Abstract.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to NADIRA SULTANA whose telephone number is (571)-272-4048.  The examiner can normally be reached on 7:30AM-5:00PM (EST); M-F. If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Richemond Dorvil can be reached on (571)-272-7602.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-
/N.S./Examiner, Art Unit 2658              

/RICHEMOND DORVIL/Supervisory Patent Examiner, Art Unit 2658