Notice of Pre-AIA  or AIA  Status

The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
DETAILED ACTION
1.	This action is responsive to Application no.16/989,012 and remarks filed 3/14/2022.  All claims have been examined and are currently pending.
Information Disclosure Statement
2.	The information disclosure statement (IDS) submitted is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Response to Amendment
3.	Claims 1, 10, 16, 20 have been amended.
Response to Arguments
4.	Applicants arguments filed have been fully considered but are moot based on the new grounds of rejection responsive to the amendments.
	Applicant argues that cited prior art Sypniewski does not teach different training phases.  Examiner respectfully disagrees, where the reference teaches training of different components.  To further advance prosecution, Examiner also incorporated Yu to address Applicant’s intended claim direction, where the separate encoders are trained separately (see art rejections below).
	Regarding Applicants arguments of a third training phase, Examiner respectfully disagrees as all layers and stacks of end-to-end system are trained (109) (see 

	Regarding claims 15, 17, and 19 Applicant argues against the use of prior art Chen as there is no motivation to combine.
	Examiner respectfully disagrees.  Similar to Sypniewski, Chen teaches spoken language understanding without speech recognition (title) for deriving semantics directly from the speech signal (abstract), where the use of the term intent corresponds to determining the semantics of the speech signal.  Chen also utilizes trained neural networks and thus the reference is of the same field of endeavor and could be incorporated with Sypniewski to improve upon the systems of Sypniewski.  

	The rejections of the additional claims are maintained based on arguments above and art rejections below.

Claim Rejections - 35 USC § 103
5.	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

6.	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

7.	Claims 1-10, 13-14, 16, 20 are rejected under 35 U.S.C. 103 as being unpatentable over Sypniewski et al (2020/0035222) in view of Yu et al (2015/0269933).

Regarding claim 1 Sypniewski teaches A method of spoken language understanding (abstract:  systems and methods for speech recognition and classification; 91 end-to-end speech classification), the method comprising: 
receiving audio data for a spoken language expression (7; 57 audio input); 
encoding the audio data using a multi-stage encoder comprising a basic encoder and a sequential encoder, wherein the basic encoder is trained to generate character features during a first training phase and the sequential encoder is trained to generate token features during a second training phase, wherein the first training phase is based on ground-truth character features, and wherein the second training phase is based on ground-truth token features (Application para 37 teaches where character corresponds to basic acoustic features and token corresponds to complex; abstract: multiple neural networks to form an end-to-end neural network; 56; 65 CNN can encode, CNN receive audio input…processes audio features to determine first set of features; phoneme representation; 
56; 80 RNN…sequential…receives acoustic features and outputs features related to words
 [0109] Turning to the method of training the neural networks, in some embodiments, all layers and stacks of an end-to-end speech recognition system 200, end-to-end speech classification system 700, or end-to-end phoneme recognition system 800 are… trained … based on training data that contains audio and an associated ground-truth output; 
56: the features produced by CNN stack 202 are entirely learned in the training process, and RNN stack 204 learns relationships between sounds and words through training as well); and 
decoding the token features to generate semantic information representing the spoken language expression (7 semantic; 91-94; 91 end-to-end speech classification, semantic topic; 94: output neural network stack…probability spoken words corresponds to associated classification).  
Sypniewski teaches end-to-end neural networks for speech recognition and classification that determine acoustic features, word features, and semantic information.  Training is performed using ground truth information.  Paragraph 56 also teaches the features produced by CNN stack 202 are entirely learned in the training process, and RNN stack 204 learns relationships between sounds and words through training as well.
The claim language of the independent claims has been amended to further focus on the different training phases.  However, the claim language can read as separate portions or steps (phases) of an overall training. 

To further advance prosecution, Yu teaches multiple neural networks that are trained separately (abstract: training a first neural network, training a second neural network; 22).  It would have been obvious to one of ordinary skill in the art before the effective filing data to allow each component of Sypniewski to be trained separately for more efficient processing by the components, and improved classification.


Regarding claim 2 Sypniewski teaches The method of claim 1, further comprising: 
generating a spectrogram based on the audio data (6; 60 spectrogram); and 
dividing the spectrogram into a plurality of frames, wherein the multi-stage encoder takes the frames as input (59 frames).  

Regarding claim 3 Sypniewski teaches The method of claim 2, further comprising: 
generating a sequence of character feature vectors using the basic encoder, wherein each of the sequence of character feature vectors corresponds to one of the frames (61-64: vectors, tensor; 65; 71-72); and 
generating a sequence of token feature vectors based on the sequence of character feature vectors using the sequential encoder (80-81 vector; 86-87).  

65; 71-72 acoustic feature representation).  

Regarding claim 5 Sypniewski teaches The method of claim 3, further comprising: generating a first token feature vector (80-81); and generating a second token feature vector based at least in part on the first token feature vector (80; 84: RNN processes tensors, output for each is dependent on previously processed frames).  

Regarding claim 6 Sypniewski teaches The method of claim 1, further comprising: 
identifying a decoding position (94: output; 118-119); 41Docket No. P9861-US (8828-105) 
computing a ratio between an output sequence of the multi-stage encoder and a subsequent input sequence (118-119); and 
computing a sum of encoder states based on the decoding position and the ratio, wherein the decoding is based on the sum of encoder states (118-119 attention mechanism for neural networks; memory; 121 – discussed in app 72-74 where in NN need to determine position for decoder, and uses attention and sequence lengths, also discussed in ref).  

Regarding claim 7 Sypniewski teaches The method of claim 1, further comprising: generating a response to the spoken language expression based on the semantic information (91-94 speech classification, semantic topic).  

Regarding claim 8 Sypniewski teaches The method of claim 1, wherein: the semantic information comprises contextual information (91 sports, politics, news).  

Regarding claim 9 Sypniewski teaches The method of claim 1, wherein: the semantic information includes attribute names, attribute values, or both (91 data that helps understand/classify input).  


Regarding claim 10 Sypniewski teaches A method of training a neural network for spoken language understanding, the method comprising: 
training a basic encoder to generate character features based on a spoken language expression in a first phase based on ground-truth character features (56; 65; 71-72); 
training a sequential encoder to generate token features based on the spoken language expression in a second phase based on ground-truth token features (56; 80); 
combining the basic encoder, the sequential encoder and a decoder in sequence to produce an end-to-end neural network for spoken language understanding (abstract; 54; 56; 91-94); and 
training the end-to-end network to generate semantic information for the spoken language expression (8 all layers and stacks of end-to-end system are trained; 13, 17 training; 56; 91 semantic; 109).
 does not specifically teach where Yu teaches separate training for each encoder.
Recites limitations similar to claim 1 and is rejected for similar rationale and reasoning   


Regarding claim 13 Sypniewski teaches The method of claim 10, further comprising: 
appending a linear layer and a sequential decoder to layers of a basic model (56; 80-81 RNN…layers; 84 output); 
predicting token features for the spoken language expression using the linear layer and the sequential decoder (80: outputs features); 
comparing the predicted token features to ground-truth token features (56; 92; 109; 114 ground truth); and 
adjusting parameters of the layers of the basic model, the linear layer, and the sequential decoder based on the comparison (8; 13; 17; 56; 109; 114 - training/learning for sequential encoder).  

Regarding claim 14 Sypniewski teaches The method of claim 10, further comprising: 
predicting semantic information for the spoken language expression using the end-to- end network (7; 91); 
109; 114); and 
updating parameters of the basic encoder, the sequential encoder and the decoder based on the comparison
(training entire system 8; 56; 114 ground truth; 109 training, ground truth, and back propagation).  


Regarding claim 16 Sypniewski teaches An apparatus for spoken language understanding, comprising: 
a basic encoder configured to generate character features based on audio data for a spoken language expression, wherein the basic encoder is trained in a first training phase based on ground-truth character features; 
a sequential encoder configured to generate token features based on an output of the basic encoder, wherein the sequential encoder is trained during a second training phase based on ground-truth token features; and 
a decoder configured to generate semantic information for the spoken language expression based on an output of the sequential encoder, wherein the decoder is trained together with the basic encoder and the sequential encoder during a third training phase (54; 56; 109; 114 – where the entire system can be trained together).
Sypniewski does not specifically teach where Yu teaches separate training for each encoder.
Recites limitations similar to claim 1 and is rejected for similar rationale and reasoning.  


Regarding claim 20 Sypniewski and Yu teach The apparatus of claim 16, wherein: 
the apparatus is trained using an incremental training process including the first training phase for training the basic encoder, the second training phase for training the sequential encoder, and 
the third training phase for training the neural network as a whole (Sypniewski 8; 13; 17; 56; 109).
Rejected for similar rationale and reasoning as claim 1/16 (where Sypniewski teaches training the neural network as a whole)


8.	Claims 11 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Sypniewski et al (2020/0035222) in view of Yu et al in further view of Zhao (2021/0312905).

Regarding claim 11 Sypniewski teaches The method of claim 10, further comprising: 
appending one or more linear layers [with a log-softmax output function] to the basic encoder (56 CNN learned in training process; 66 layers); 
6 CNN processes audio features); 
comparing the predicted character features to ground-truth character features (109; 114 ground truth); and 
adjusting parameters of the basic encoder based on the comparison (56; 109; 114 - training and implementing NN and continue to update and learn based on outputs),
but does not teach specifically teach where Zhao teaches neural network and log-softmax (Zhao 31, 33, 41).
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Zhao presenting a reasonable expectation of success in neural network training and performance.

Regarding claim 18 Sypniewski and Zhao teach The apparatus of claim 16, wherein: the basic encoder is trained using a linear layer with a log-softmax output function.  
	Rejected for similar rationale and reasoning as claim 11

9.	Claim 12 is rejected under 35 U.S.C. 103 as being unpatentable over Sypniewski et al (2020/0035222) in view of Yu in further view of Zhao (2021/0312905) in further view of Vasconcelos et al (2021/0012769).


removing the log-softmax output function prior to combining the basic encoder, the sequential encoder, and the decoder (Vasconcelos 107; 151 softmax layer may be removed).
It would have been obvious to one of ordinary skill in the art before the effective filing date to remove the softmax layer to allow for receipt of different outputs for neural network implementation.



10.	Claims 15, 17, 19 are rejected under 35 U.S.C. 103 as being unpatentable over Sypniewski et al (2020/0035222) in view of Yu in further view of Chen (Chen, Yuan-Ping, Ryan Price, and Srinivas Bangalore.  “Spoken Language Understanding Without Speech Recognition.”  2018 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP).  IEEE, 2018.)

Regarding claim 15 Sypniewski does not specifically teach where Chen teaches The method of claim 10, further comprising: training the end-to-end network is based on a connectionist temporal classification (CTC) loss (Chen 3.1 CTC).
It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate CTC for improved network training.


Regarding claim 17 Sypniewski does not specifically teach where Chen teaches The apparatus of claim 16, wherein: the basic encoder comprises one or more convolutional neural network (CNN) layers and one or more recurrent neural network (RNN) layers (Chen 3.1 acoustic model component is a grapheme based network with convolutional and recurrent layers trained with Connectionist Temporal Classification).  
	Sypniewski already teaches where the basic encoder can be a CNN.
	It would have been obvious to one of ordinary skill in the art before the effective filing date to incorporate Chen for improved network implementation (allowing to incorporate the additional layers for the encoder for an improved and more efficient feature processing).
	
Regarding claim 19 Sypniewski teaches The apparatus of claim 16, wherein: the sequential encoder comprises [one or more CNN layers,] one or more RNN layers, a linear layer, and a sequential decoder (80-86);  
but does not specifically teach where Chen teaches wherein the encoder comprises one or more CNN layers. 
Rejected for Similar rationale and reasoning as claim 17



Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP 
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHAUN A ROBERTS whose telephone number is (571)270-7541.  The examiner can normally be reached Monday-Friday 9-5 EST.

If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on 571-272-7516.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  

/SHAUN ROBERTS/
Primary Examiner, Art Unit 2655