DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.
Claims 1-22 are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1-26 of U.S. Patent No. 10929754; and over claims 1-20 of U.S. Patent No. 10593352. Although the claims at issue are not identical, they are not patentably distinct from each other because they are obvious variants of the same invention.  Furthermore, the claims of the patents anticipate the claims of the application.

Claims of USPN 10929754 			Claims of application
1. A method of training a multidomain endpointer model, the method comprising: obtaining, by data processing hardware, training data comprising: a first training set of short-form speech utterances; and a second training set of long-form speech utterances; and for each short-form speech utterance in the first training set of short-form speech utterances: providing, by the data processing hardware, the corresponding short-form speech utterance as input to a shared neural network of the multidomain neural network, the shared neural network configured to learn shared hidden representations suitable for both voice activity detection (VAD) and end-of-query (EOQ) detection; generating, by the data processing hardware, using a VAD classifier output layer of the multidomain endpointer model, a short-form speech sequence of predicted VAD labels each comprising one of a predicted VAD speech label or a predicted VAD silence label; determining, by the data processing hardware, a short-form speech VAD loss associated with the corresponding short-form speech utterance by comparing the short-form speech sequence of predicted VAD labels to a corresponding short-form speech sequence of reference VAD labels for the corresponding short-form speech utterance using forced alignment; generating, by the data processing hardware, using an EOQ classifier output layer of the multidomain endpointer model, a sequence of predicted EOQ labels each comprising one of a predicted EOQ speech label, a predicted EOQ initial silence label, a predicted EOQ intermediate silence label, or a predicted EOQ final silence label; determining, by the data processing hardware, an EOQ loss associated with the corresponding short-form speech utterance by comparing the short-form speech sequence of predicted EOQ labels to a corresponding short-form speech sequence of reference EOQ labels for the corresponding short-form speech utterance using forced alignment; and training, by the data processing hardware, using cross-entropy criterion, the multidomain endpointer model based on the short-form speech VAD loss and the EOQ loss.
2. The method of claim 1, further comprising, for each long-form speech utterance in the second training set of long-form speech utterances: providing, by the data processing hardware, the corresponding long-form speech utterance as input to the shared neural network of the multidomain neural network; generating, by the data processing hardware, using the VAD classifier output layer of the multidomain endpointer model, a long-form speech sequence of predicted VAD labels each comprising one of the predicted VAD speech label or the predicted VAD silence label; determining, by the data processing hardware, a long-form speech VAD loss by comparing the long-form speech sequence of predicted VAD labels to a corresponding long-form speech sequence of reference VAD labels for the corresponding long-form speech utterance using force alignment; and training, by the data processing hardware, using the cross-entropy criterion, the multidomain endpointer model using the long-form speech VAD loss.
3. The method of claim 2, wherein: providing the corresponding short-form speech utterance as input to the shared neural network comprises providing short-form domain information as an additional input to the shared neural network, the short-form domain information indicating that the corresponding short-form speech utterance is associated with a short-form speech domain, and providing the corresponding long-form speech utterance as input to the shared neural network comprises providing long-form domain information as an additional input to the shared neural network, the long-form speech domain information indicating that the corresponding long-form speech utterance is associated with a long-form speech domain.
4. The method of claim 3, wherein: each short-form speech utterance in the first training set of short-form speech utterances comprises a corresponding sequence of short-form acoustic features representing the short-form speech utterance; each long-form speech utterance in the second training set of long-form speech utterances comprises a corresponding sequence of long-form acoustic features representing the long-form speech utterance; providing the corresponding short-form speech utterance and the short-form domain information as inputs to the shared neural network comprises: for each short-form acoustic feature of the sequence of short-form acoustic features, generating, using a domain encoder layer, a corresponding domain-aware hidden speech representation by converting a concatenation between the short-form acoustic feature and a short-form domain index, the short-form domain index representing the short-form domain information; and providing, as input to the shared neural network, the corresponding domain-aware hidden speech representations generated for the sequence of short-form acoustic features; and providing the corresponding long-form speech utterance and the long-form domain information as inputs to the shared neural network comprises: for each long-form acoustic feature of the sequence of long-form acoustic features, generating, using the domain encoder layer, a corresponding domain-aware hidden speech representation by converting a concatenation between the long-form acoustic feature and a long-form domain index, the long-form domain index representing the long-form domain information; and providing, as input to the shared neural network, the corresponding domain-aware hidden speech representations generated for the sequence of long-form acoustic features.
5. The method of claim 4, wherein the short-form and long-form domain indexes comprise categorical integers.
6. The method of claim 1, further comprising thresholding, by the data processing hardware, framewise posteriors of the predicted EOQ final silence labels generated for each of the short-form speech utterances to obtain a hard microphone closing decision.
7. The method of claim 1, further comprising, for each short-form speech utterance in the first training set of short-form speech utterances, predicting, by the data processing hardware, using the VAD classifier output layer, an EOQ decision upon detecting a duration of silence in the corresponding short-form speech utterance that satisfies a time threshold.
8. The method of claim 1, wherein the shared neural network comprises a unified convolutional, long short-term memory, deep neural network (CLDNN) having a unidirectional architecture.
9. The method of claim 8, wherein the shared neural network comprising the CLDNN comprises: a convolutional input layer; a first feedforward deep neural network (DNN) layer configured to receive, as input during each of a plurality of time steps, an output of the convolutional input layer; one or more long short-term memory LSTM) layers, and a second feedforward DNN layer.
10. The method of claim 1, wherein: the VAD classifier output layer comprises a first softmax output layer configured to receive, as input during each of a plurality of time steps, an output of the second feedforward DNN layer of the CLDNN; and the EOQ classifier output layer comprises a second softmax output layer configured to receive, as input during each of the plurality of time steps, the output of the second feedforward DNN layer of the CLDNN.
11. The method of claim 1, wherein the first training set of short-form speech utterances each comprise a duration that is shorter than a duration of each of the long-form speech utterances of the second training set of long-form speech utterances.
12. The method of claim 1, wherein each short-form speech utterance in the first training set of short-form speech utterances is associated with one of a voice query or a voice command.
13. The method of claim 1, wherein each long-form speech utterance in the second training set of long-form speech utterances comprises a duration of at least ten seconds.
Claims 14-26 are similar to claims 1-13 above
1. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising: receiving, as input to a multidomain endpointer model, a sequence of audio features representing an utterance captured by a microphone of a user device, the multidomain endpointer model comprising a shared neural network trained on: a first training set of short-form speech utterances; and a second training set of long-form speech utterances; generating, as output from the multidomain endpointer model, a sequence of predicted end-of-query (EOQ) speech labels comprising a predicted EOQ speech label, a predicted EOQ initial silence label, a predicted EOQ intermediate silence label, and a predicted EOQ final silence label; and when the predicted EOQ final silence label is output from the multidomain endpointer model, obtaining a hard microphone closing decision that causes the user device to endpoint the utterance by deactivating the microphone.  

2. The computer-implemented method of claim 1, wherein: each short-form speech utterance in the first training set of short-form speech utterances comprises a corresponding short-form speech sequence of reference EOQ labels comprising a reference EOQ speech label, a reference EOQ initial silence label, a reference EOQ intermediate silence label, and a reference EOQ final silence label; and each long-form speech utterance in the second training set of long-form speech utterances comprises a corresponding long-form speech sequence of reference voice activity detection (VAD) comprising a reference VAD speech label and a reference VAD silence label.  

3. The computer-implemented method of claim 1, wherein the shared neural network of the multidomain endpointer model is trained to learn shared hidden representations suitable for both voice activity detection (VAD) and EOQ detection.  
4. The computer-implemented method of claim 1, wherein the operations further comprise: generating, as output from the multidomain endpointer model, a sequence of predicted VAD speech labels comprising a predicted VAD speech label and a predicted VAD silence label, wherein obtaining the hard microphone closing decision is further based on when the predicted VAD silence label is output from multidomain endpointer model.  
5. The computer-implemented method of claim 1, wherein the multidomain endpointer model comprises: a voice activity detection (VAD) classifier output layer configured to output a sequence of predicted VAD labels, the VAD classifier layer trained on: the first training set of short-form speech utterances; and the second set of long-form speech utterances; and an EOQ classifier output layer configured to output the sequence of predicted EOQ speech labels in parallel with the sequence of predicted VAD labels output by the VAD classifier output layer, the EOQ classifier output layer trained on the first training set of short-form speech utterances while excluding the second training set of long-form speech utterances.  

6. The computer-implemented method of claim 5, wherein the multidomain endpointer model comprises a unified convolutional, long short-term memory, deep neural network (CLDNN) having a unidirectional architecture.  

7. The computer-implemented method of claim 6, wherein the multidomain endpointer model comprising the CLDNN comprises: a convolutional input layer; a first feedforward deep neural network (DNN) layer configured to receive, as input during each of a plurality of time steps, an output of the convolutional input layer; one or more long short-term memory LSTM) layers; and a second feedforward DNN layer.  

8. The computer-implemented method of claim 7, wherein: the VAD classifier output layer comprises a first softmax output layer configured to receive, as input during each of a plurality of time steps, an output of the second feedforward DNN layer of the CLDNN; and the EOQ classifier output layer comprises a second softmax output layer configured to receive, as input during each of the plurality of time steps, the output of the second feedforward DNN layer of the CLDNN.  

9. The computer-implemented method of claim 1, wherein the first training set of short-form speech utterances each comprise a duration that is shorter than a duration of each of the long-form speech utterances of the second training set of long-form speech utterances.  

10. The computer-implemented method of claim 1, wherein each short-form speech utterance in the first training set of short-form speech utterances is associated with one of a voice query or a voice command.  

11. The computer-implemented method of claim 1, wherein each long-form speech utterance in the second training set of long-form speech utterances comprises a duration of at least ten seconds.

Claims 12-22 are similar claims 1-11 above.





Claims of USPN 10593352				Claims of application
1. A computer-implemented method comprising: receiving audio data that corresponds to an utterance spoken by a user; applying, to the audio data, an end of query model that (i) is configured to determine a confidence score that reflects a likelihood that the utterance is a complete utterance and (ii) was trained using audio data from complete utterances and from incomplete utterances; based on applying the end of query model that (i) is configured to determine the confidence score that reflects the likelihood that the utterance is a complete utterance and (ii) was trained using the audio data from the complete utterances and from the incomplete utterances, determining the confidence score that reflects a likelihood that the utterance is a complete utterance; comparing the confidence score that reflects the likelihood that the utterance is a complete utterance to a confidence score threshold; based on comparing the confidence score that reflects the likelihood that the utterance is a complete utterance to the confidence score threshold, determining whether the utterance is likely complete or likely incomplete; and based on determining whether the utterance is likely complete or likely incomplete, providing, for output, an instruction to (i) maintain a microphone that is receiving the utterance in an active state or (ii) deactivate the microphone that is receiving the utterance.
2. The method of claim 1, comprising: based on comparing the confidence score that reflects the likelihood that the utterance is a complete utterance to the confidence score threshold, determining that the confidence score satisfies the confidence score threshold, wherein determining whether the utterance is likely complete or likely incomplete comprises determining the utterance is likely complete based on determining that the confidence score satisfies the confidence score threshold, wherein providing, for output, an instruction to (i) maintain a microphone that is receiving the utterance in an active state or (ii) deactivate the microphone that is receiving the utterance comprises providing, for output, the instruction to deactivate the microphone that is receiving the utterance, generating a transcription of the audio data, and providing, for output, the transcription.
3. The method of claim 2, comprising: receiving, from a user, data confirming that the user finished speaking; and based on receiving the data confirming that the user finished speaking, updating the end of query model.
4. The method of claim 1, comprising: based on comparing the confidence score that reflects the likelihood that the utterance is a complete utterance to the confidence score threshold, determining that the confidence score does not satisfy the confidence score threshold, wherein determining whether the utterance is likely complete or likely incomplete comprises determining the utterance is likely incomplete based on determining that the confidence score does not satisfy the confidence score threshold, and wherein providing, for output, an instruction to (i) maintain a microphone that is receiving the utterance in an active state or (ii) deactivate the microphone that is receiving the utterance comprises providing, for output, the instruction to maintain the microphone in an active state.
5. The method of claim 1, comprising: receiving audio data of multiple complete utterances and multiple incomplete utterances; and training, using machine learning, the end of query model using the audio data of the multiple complete utterances and the multiple incomplete utterances.
6. The method of claim 1, wherein the end of query model is configured to determine the confidence score that reflects the likelihood that the utterance is a complete utterance based on acoustic speech characteristics of the utterance that include pitch, loudness, intonation, sharpness, articulation, roughness, instability, and speech rate.
7. The method of claim 1, comprising: determining that a speech decoder that is configured to generate a transcription of the audio data and that is configured to determine whether the utterance is likely complete or likely incomplete has not determined whether the utterance is likely complete or likely incomplete, wherein determining whether the utterance is likely complete or likely incomplete is based on only comparing the confidence score that reflects the likelihood that the utterance is a complete utterance to the confidence score threshold.
8. The method of claim 7, wherein the speech decoder uses a language model to determine whether the utterance is likely complete or likely incomplete.
9. The method of claim 1, comprising: determining that a speech decoder that is configured to generate a transcription of the audio data and that is configured to determine whether the utterance is likely complete or likely incomplete has determined whether the utterance is likely complete or likely incomplete, wherein determining whether the utterance is likely complete or likely incomplete is based on (i) comparing the confidence score that reflects the likelihood that the utterance is a complete utterance to the confidence score threshold and (ii) the speech decoder determining whether the utterance is likely complete or likely incomplete.
Claims 10-20 are similar to claims 1-9 above.
1. A computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising: receiving, as input to a multidomain endpointer model, a sequence of audio features representing an utterance captured by a microphone of a user device, the multidomain endpointer model comprising a shared neural network trained on: a first training set of short-form speech utterances; and a second training set of long-form speech utterances; generating, as output from the multidomain endpointer model, a sequence of predicted end-of-query (EOQ) speech labels comprising a predicted EOQ speech label, a predicted EOQ initial silence label, a predicted EOQ intermediate silence label, and a predicted EOQ final silence label; and when the predicted EOQ final silence label is output from the multidomain endpointer model, obtaining a hard microphone closing decision that causes the user device to endpoint the utterance by deactivating the microphone.  

2. The computer-implemented method of claim 1, wherein: each short-form speech utterance in the first training set of short-form speech utterances comprises a corresponding short-form speech sequence of reference EOQ labels comprising a reference EOQ speech label, a reference EOQ initial silence label, a reference EOQ intermediate silence label, and a reference EOQ final silence label; and each long-form speech utterance in the second training set of long-form speech utterances comprises a corresponding long-form speech sequence of reference voice activity detection (VAD) comprising a reference VAD speech label and a reference VAD silence label.  

3. The computer-implemented method of claim 1, wherein the shared neural network of the multidomain endpointer model is trained to learn shared hidden representations suitable for both voice activity detection (VAD) and EOQ detection.  

4. The computer-implemented method of claim 1, wherein the operations further comprise: generating, as output from the multidomain endpointer model, a sequence of predicted VAD speech labels comprising a predicted VAD speech label and a predicted VAD silence label, wherein obtaining the hard microphone closing decision is further based on when the predicted VAD silence label is output from multidomain endpointer model.  

5. The computer-implemented method of claim 1, wherein the multidomain endpointer model comprises: a voice activity detection (VAD) classifier output layer configured to output a sequence of predicted VAD labels, the VAD classifier layer trained on: the first training set of short-form speech utterances; and the second set of long-form speech utterances; and an EOQ classifier output layer configured to output the sequence of predicted EOQ speech labels in parallel with the sequence of predicted VAD labels output by the VAD classifier output layer, the EOQ classifier output layer trained on the first training set of short-form speech utterances while excluding the second training set of long-form speech utterances.  

6. The computer-implemented method of claim 5, wherein the multidomain endpointer model comprises a unified convolutional, long short-term memory, deep neural network (CLDNN) having a unidirectional architecture.  

7. The computer-implemented method of claim 6, wherein the multidomain endpointer model comprising the CLDNN comprises: a convolutional input layer; a first feedforward deep neural network (DNN) layer configured to receive, as input during each of a plurality of time steps, an output of the convolutional input layer; one or more long short-term memory LSTM) layers; and a second feedforward DNN layer.  

8. The computer-implemented method of claim 7, wherein: the VAD classifier output layer comprises a first softmax output layer configured to receive, as input during each of a plurality of time steps, an output of the second feedforward DNN layer of the CLDNN; and the EOQ classifier output layer comprises a second softmax output layer configured to receive, as input during each of the plurality of time steps, the output of the second feedforward DNN layer of the CLDNN.  

9. The computer-implemented method of claim 1, wherein the first training set of short-form speech utterances each comprise a duration that is shorter than a duration of each of the long-form speech utterances of the second training set of long-form speech utterances.  

10. The computer-implemented method of claim 1, wherein each short-form speech utterance in the first training set of short-form speech utterances is associated with one of a voice query or a voice command.  

11. The computer-implemented method of claim 1, wherein each long-form speech utterance in the second training set of long-form speech utterances comprises a duration of at least ten seconds.

Claims 12-22 are similar to claims 1-11 above.



Allowable Subject Matter
Claims 1-22 are allowable and will be allowed when the double patenting issue is resolved.  The following is an examiner’s statement of reasons for allowance: Nagano et al. (USPG 2018/0130460), considered the closest prior art, disclose a computer-implemented method when executed on data processing hardware causes the data processing hardware to perform operations comprising: receiving, as input to a multidomain endpointer model, a sequence of audio features representing an utterance captured by a microphone of a user device, the multidomain endpointer model comprising a shared neural network (figure 4, steps 410C-D; training the EOQ classifier) trained on: a first training set of short-form speech utterances (figure 4, steps 410A-B, training dataset can be short and/or long speech utterances); and a second training set of long-form speech utterances (figure 4, steps 410A-B, training dataset can be short and/or long speech utterances). 
Jiang (USPG 2020/0327168) teaches a process in which segmented sentences are generated and inputted into a pre-training segmented sentence recognition model to determine the input sentence is a complete sentence based on a recurrent neural network language model (see figures 2-3 and/or abstract section).
Vickers (USPG 2016/0093313) teaches a neural network-based voice activity detection process in which speech is processed and inputted into a neural network to determine a VAD estimate (paragraph 5 and 61-62; also see figure 1); and generating, by the data processing hardware, using a VAD classifier output layer of the multidomain endpointer model, a short-form speech sequence of predicted VAD labels each comprising one of a predicted VAD speech label or a 15predicted VAD silence label (paragraph 5 and 61-62; also see figure 1).
The prior art on record, individually or in combination, fail to explicitly disclose the combination of the following limitations regarding “generating, as output from the multidomain endpointer model, a sequence of predicted end-of-query (EOQ) speech labels comprising a predicted EOQ speech label, a predicted EOQ initial silence label, a predicted EOQ intermediate silence label, and a predicted EOQ final silence label; and when the predicted EOQ final silence label is output from the multidomain endpointer model, obtaining a hard microphone closing decision that causes the user device to endpoint the utterance by deactivating the microphone.”  Furthermore, it would not have been obvious to one of ordinary skill in the art to modify the prior art in order to arrive at the claimed invention.  Therefore, claims 1-22 are allowed.
Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee.  Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.  Finkelstein et al. (USPN 9355191) teaches a process using query completions to train a NN, and Sainath et al. (USPG 2017/0092297) teaches a process using a neural network to classify a signal.  These two references are considered pertinent to the claimed invention.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to HUYEN X VO whose telephone number is (571)272-7631. The examiner can normally be reached M-F, 8-4.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/HUYEN X VO/Primary Examiner, Art Unit 2656