DETAILED ACTION
This action is in response to the initial filing of Application no. 17/728,713 on 04/25/2022.
Claims 1 – 20 are still pending in this application, with claims 1 and 10 being independent.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.

Claims 1 -4, 7, 10 – 13 and 16 are  rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1, 4- 8, 10, 12- 14 and 16 - 20 of U.S. Patent No. 11,043,209 in view of Ali et al. (“Word Error Rate Estimation for Speech Recognition: e-WER”). Although the claims at issue are not identical, they are not patentably distinct from each other.

The claim mapping is as follows.

Current Application

1. A method for training a neural network to transcribe a media file, the method comprising: segmenting the media file into a plurality of segments; inputting each segment, one segment at a time, of the plurality of segments into a first neural network trained to perform speech recognition; extracting outputs, one segment at a time, from one or more layers of the first neural network; and training a second neural network to generate a predicted-WER (word error rate) of a plurality of transcription engines for each segment based at least on outputs from the one or more layers of the first neural network.

2. The method of claim 1, wherein training the second neural network to generate a predicted-WER of the plurality of transcription engines further comprises: transcribing each segment using the plurality of transcription engines to generate a transcription of each segment; generating a WER of each transcription engine for each segment based at least on ground truth data and the transcription of each segment; and training the second neural network to learn relationships between the generated WER of each transcription engine and outputs from the one or more layers of the first neural network for each segment.

3. The method of claim 1, wherein the first neural network comprises a deep neural network.

4. The method of claim 3, wherein the deep neural network comprises a recurrent neural network, and the second neural network comprises a convolutional neural network.

5. The method of claim 4, wherein the convolution neural network comprises two hidden layers and a pooling layer in between the two hidden layers.

6. The method of claim 1, wherein extracting outputs from one or more layers of the first neural network comprises extracting outputs from a last hidden layer of the deep neural network.

7. The method of claim 1, wherein extracting outputs from one or more layers of the first neural network comprises extracting outputs from a first and last hidden layers of the deep neural network.

8. The method of claim 1, further comprising using an autoencoder neural network to reduce a number of input features from each segment such that a number of outputs from the first neural network are reduced.

9. The method of claim 8, wherein the autoencoder comprises approximately 256 channels.

10. A system for training a neural network to transcribe a media file, the system comprising: a memory; and one or more processors coupled to the memory, the one or more processor configured to: segment the media file into a plurality of segments; input each segment of the plurality of segments into a first neural network trained to perform speech recognition; extract outputs from one or more layers of the first neural network; and train a second neural network to generate a predicted-WER (word error rate) of a plurality of transcription engines for each segment based at least on outputs from the one or more layers of the first neural network.

11. The system of claim 10, wherein the one or more processors are configured to train the second neural network to generate a predicted-WER further comprises configuring the one or more processor to: transcribe each segment using the plurality of transcription engines to generate a transcription of each segment; generate a WER of each transcription engine for each segment based at least on ground truth data and the transcription of each segment; and train the second neural network to learn relationships between the generated WER of each transcription engine and outputs from the one or more layers of the first neural network for each segment.

12. The system of claim 10, wherein the first neural network comprises a deep neural network.

13. The system of claim 12, wherein the deep neural network comprises a recurrent neural network, and the second neural network comprises a convolutional neural network.

14. The system of claim 13, wherein the convolution neural network comprises two hidden layers and a pooling layer in between the two hidden layers.

15. The system of claim 10, wherein the one or more processors are configured to extract outputs from one or more layers of the first neural network further comprises configuring the one or more processors to extract outputs from a last hidden layer of the deep neural network.

16. The system of claim 10, wherein the one or more processors are configured to extract outputs from one or more layers of the first neural network further comprises configuring the one or more processors to extract outputs from a first and last hidden layers of the deep neural network.

17. The system of claim 10, wherein the one or more processors are further configured to use an autoencoder neural network to reduce a number of input features from each segment such that a number of outputs from the one or more layers of the first neural network are reduced.

18. The system of claim 17, wherein the autoencoder comprises approximately 256 channels.

19. The system of claim 10, wherein the media file is segmented into segments having a duration ranging between 2 to 10 seconds.

20. The system of claim 19, wherein each segment comprises a 5-second segment.
US 11,043,209

1. A method for transcribing a media file, the method comprising: segmenting the media file into a plurality of segments; extracting, using a first neural network, audio features of a first and second segment of the plurality of segments, wherein the first neural network is trained to perform speech recognition; and identifying, using a second neural network, a best-candidate engine for each of the first and second segments based at least on audio features of the first and second segments, wherein the best-candidate engine is a neural network having a highest predicted transcription accuracy among a collection of neural networks.

2. The method of claim 1, further comprising: requesting a first best-candidate engine for the first segment to transcribe the first segment; requesting a second best-candidate engine for the second segment to transcribe the second segment; receiving a first transcribed portion of the first segment from the first best-candidate engine in response to requesting the first best-candidate engine to transcribe the first segment; receiving a second transcribed portion of the second segment from the second best-candidate engine in response to requesting the second best-candidate engine to transcribe the second segment; and generating a merged transcription using the first and second transcribed portions.

3. The method of claim 1, wherein segmenting the media file comprises segmenting the media file at location of the media file where no speech is detected.

4. The method of claim 1, wherein extracting using the first neural network comprises using a deep neural network to extract audio features of the first and second segments.

5. The method of claim 4, wherein using the deep neural network to extract audio features comprises using outputs of one or more hidden layers of the deep neural network as inputs to the second neural network.

6. The method of claim 5, wherein the deep neural network comprises a recurrent neural network, and the second neural network comprises a convolutional neural network.

7. The method of claim 5, wherein using outputs of one or more hidden layers of the deep neural network as inputs comprises using outputs of a last hidden layer of the deep neural network as inputs to the second neural network.

8. The method of claim 1, wherein the second neural network is trained to predict a word error rate (WER) of a plurality of transcription engines based at least on audio features extracted from each segment.

9. The method of claim 8, wherein identifying the best-candidate engine for each of the first and second segments comprises identifying a transcription engine with a lowest WER for each segment.

10. A system for transcribing a media file, the system comprising: a memory; and one or more processors coupled to the memory, the one or more processor configured to: segment the media file into a plurality of segments; extract, using a first neural network, audio features of a first and second segment of the plurality of segments, wherein the first neural network is trained to perform speech recognition; and identify, using a second neural network, a best-candidate engine for each of the first and second segments based at least on audio features of the first and second segments, wherein the best-candidate engine is a neural network having a highest predicted transcription accuracy among a collection of neural networks.

11. The system of claim 10, wherein the one or more processors are further configured to: request a first best-candidate engine for the first segment to transcribe the first segment; request a second best-candidate engine for the second segment to transcribe the second segment; receive a first transcribed portion of the first segment from the first best-candidate engine in response to requesting the first best-candidate engine to transcribe the first segment; receive a second transcribed portion of the second segment from the second best-candidate engine in response to requesting the second best-candidate engine to transcribe the second segment; and generate a merged transcription using the first and second transcribed portions.

12. The system of claim 10, wherein the one or more processors are configured to extract audio features of the first and second segments using a deep neural network.

13. The system of claim 12, wherein the one or more processors are configured to extract audio features of the first and second segments using outputs of one or more hidden layers of the deep neural network as inputs to the second neural network.

14. The system of claim 13, wherein the deep neural network comprises a recurrent neural network, and the second neural network comprises a convolutional neural network.

15. The system of claim 13, wherein the one or more processors are further configured to: using an autoencoder neural network to reduce a number of outputs from the first neural network by reducing a number of inputs to the first neural network to reduce overfitting.

16. The system of claim 10, wherein the second neural network is trained to predict a word error rate (WER) of a plurality of transcription engines based at least on audio features extracted from each segment.

17. A method for transcribing an audio file, the method comprising: using an audio file as inputs to a deep neural network trained to perform speech recognition; and using outputs of one or more hidden layers of the deep neural network as inputs to a second neural network that is trained to identify a first transcription engine having a highest predicted transcription accuracy among a group of transcription engines for the audio file based at least on the outputs of the one or more hidden layers of the deep neural network.

18. The method of claim 17, wherein the second neural network is trained to predict a word error rate (WER) of the group of transcription engines based at least on outputs of the one or more hidden layers of the deep neural network and on characteristics of each respective engine of the group of transcription engines, and wherein an engine with a lowest WER is the engine with the highest predicted transcription accuracy.

19. The method of claim 17, wherein the deep and second neural networks comprise a recurrent neural network and a convolutional neural network, respectively.

20. The method of claim 17, wherein using outputs of one or more hidden layers comprises using outputs of a first and last layer of the hidden layers of the deep neural network.



As shown above, claims 1, 4-8, 10, 12 — 14 and 16 — 20 of Application no. 16/243,033
recites the limitation of claims 1 — 4, 7, 10-13 and 16 of the currently pending application, except for actually training the second neural network. However, Ali discloses the process of training a second neural network to predict WER based on outputs from a first neural network as discussed in the prior art rejection below. Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to modify the recited limitations of claims 1, 4-8, 10, 12 — 14 and 16 — 20 of Application no. 16/243,033 with Ali so that the second neural network is trained for the purpose of predicting word error rates of a plurality of transcription engines to select the best engine for a given file. Therefore, claims 1 — 4, 7, 10-13 and 16 of the currently pending application and claims 1, 4-8, 10, 12 — 14 and 16 — 20 of Application no. 16/243,033 are obvious variants.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim(s) 1 – 4, 10 – 13, 19 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Ali (“Word Error Rate Estimation for Speech Recognition: e-WER”) (“Ali”) in view of Sak et al. (US 2018/0053500) (“Sak”) and further in view of Nir (US 2018/0047387).
For claim 1, Ali discloses a method form training a neural network to transcribe a media file (Abstract), the method comprising: training a neural network to generate a predicted-WER (word error rate) of a plurality of transcription engines (LVCSR system and grapheme-sequence based system, 2 e-WER Framework, pg. 21) for media based on outputs from a transcription engine (character sequence extracted from grapheme recognition, 2.1 e-WER features, 2.2 Classification Back-End, 3 Experiments, pg.21). Yet, Ali fails  teach the following: segmenting a media file into a plurality of segments; inputting each segment, one segment at a time, of the plurality of segments into a first neural network trained to perform speech recognition; extracting outputs, one segment at a time, from one or more layers of the first neural network; and training the neural network of each segment based at least on outputs from the one or more layers of the first neural network.
However, Sak discloses a multi-accent speech recognition system (Abstract), wherein acoustic sequences are input into a neural network (hierarchical recurrent neural network )trained to perform speech recognition ([0035 – 0039] [0045] [0046]); and outputs (graphemes) are extracted from one or more layers of the neural network ([0039] [0040]).
	Additionally, Nir discloses a system and method for performing accurate speech translation (Abstract), wherein an audio file is divided into equal segments ([0014] [0090]). Furthermore, the segments are provided, one at a time, to an ASR module (Fig.3, 30; [0094] [0095]), which generates output, one segment at time (Fig.3 and Fig.4; [0094] [0095] [0103] [0104]).
	Therefore, it would have been obvious to one of ordinary skill  in the art at the time of applicant’s filing to modify Ali’s teaching’s with Sak’s teachings so that the outputs from the transcription engine which are used to train the neural network are extracted from one or more layers of a neural network which performs speech recognition for the purpose of improving the efficiency of speech recognition using a model which eliminates the need for a separate acoustic model and language model. 	
	Additionally, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve the combined teachings of Ali and Sak in the same way that Nir’s teachings have been improved to achieve the following, predictable results for the purpose of improving the efficiency of  speech recognition using a model which eliminates the need for a separate acoustic model and language model: an audio file is divided into a plurality of equal segments; the segments are provided, one a time, to the neural network trained to perform speech recognition; and the outputs are extracted, one segment at a time.

For claim 10, Ali discloses a system for training a neural network to transcribe a media file (Abstract), the system comprising: components to train a neural network to generate a predicted-WER (word error rate) of a plurality of transcription engines (LVCSR system and grapheme-sequence based system, 2 e-WER Framework, pg. 21) for media based on outputs from a transcription engine (character sequence extracted from grapheme recognition, 2.1 e-WER features, 2.2 Classification Back-End, 3 Experiments, pg.21). Yet, Ali fails teach the following: the system further comprises a memory and one or more processors coupled to the memory, the one or more processor configured to segment a media file into a plurality of segments; input each segment, one segment at a time, of the plurality of segments into a first neural network trained to perform speech recognition; extract outputs, one segment at a time, from one or more layers of the first neural network; and train the neural network of each segment based at least on outputs from the one or more layers of the first neural network.
However, Sak discloses a multi-accent speech recognition system (Abstract), wherein acoustic sequences are input into a neural network (hierarchical recurrent neural network)trained to perform speech recognition ([0035 – 0039] [0045] [0046]); outputs (graphemes) are extracted from one or more layers of the neural network ([0039] [0040]); and a system comprises a memory and one or more processors coupled to the memory, the one or more processor configured to perform functions related to speech recognition ([0070] [0071]).
	Additionally, Nir discloses a system and method for performing accurate speech translation (Abstract), wherein an audio filed is divided into equal segments ([0014] [0090]). Furthermore, the segments are provided, one at a time, to an ASR module (Fig.3, 30; [0094] [0095]) which generates output, one segment at time (Fig.3 and Fig.4; [0094] [0095] [0103] [0104]).
	Therefore, it would have been obvious to one of ordinary skill  in the art at the time of applicant’s filing to modify Ali’s teaching’s with Sak’s teachings so that system comprises a memory and one or more processors coupled to the memory, the one or more processor being configured to perform functions related to speech recognition including extracting outputs from one or more layers of a neural network which performs speech recognition to further train the neural network which generates a predicted -WER for the purpose of improving the efficiency of  speech recognition using a model which eliminates the need for a separate acoustic model and language model. 	
	Additionally, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to improve the combined teachings of Ali and Sak in the same way that Nir’s teachings have been improved to achieve the following, predictable results for the purpose of improving the efficiency of  speech recognition using a model which eliminates the need for a separate acoustic model and language model: an audio file is divided into a plurality of equal segments; the segments are provided, one a time, to the neural network trained to perform speech recognition; and the outputs are extracted, one segment at a time.

	For claims 2 and 11, Ali , Sak and Nir further disclose wherein training the second neural network to generate a predicted-WER of the plurality of transcription engines further comprises: transcribing each segment using the plurality of transcription engines to generate a transcription of each segment (Ali, 2.1 e-WER features, pg. 22) (Sak, [0035 – 0040] [0045] [0046]) (Nir, [0090] [0094] [0095]); generating a WER of each transcription engine for each segment based at least on ground truth data and the transcription of each segment (Ali, 2.1 e-WER features, 2.2 Classification Back-end and 3.Experiments and discussions, pg. 22); and training the second neural network to learn relationships between the generated WER of each transcription engine and outputs from the one or more layers of the first neural network for each segment (Ali, 2.1 e-WER features, 2.2 Classification Back-end and 3.Experiments and discussions, pg. 22).

For claims 3 and 12, Ali and Sak further disclose wherein the first neural network comprises a deep neural network (Ali, 2. E-WER Framework and 2.1 e-WER features, pg.21 and 22) (Sak, the HRNN includes a large number of recurrent neural networks, [0037] [0038]).
For claims 4 and 13, Ali and Sak further disclose, wherein the deep neural network comprises a recurrent neural network  (Ali, 2. E-WER Framework and 2.1 e-WER features, pg.21 and 22) (Sak, the HRNN includes a large number of recurrent neural networks, [0037] [0038]), and the second neural network comprises a convolutional neural network (Ali, CNN can be used to estimate word error rate, 4. Conclusions, pg. 23).
For claim 19, Nir further discloses wherein the media file is segmented into segments having a duration ranging between 2 to 10 seconds (Nir, [0090]).
For claim 20, Nir further discloses wherein each segment comprises  a 5-second segment (Nir, [0090]).

Claims 5 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Ali (“Word Error Rate Estimation for Speech Recognition: e-WER”) (“Ali”) in view of Sak et al. (US 2018/0053500) (“Sak”), and further in view of  Nir (US 2018/0047387) and further in view of Agranonik et al. (US 10,581,888) (“Agranonik’”).
For claims 5 and 14, the combination of Ali, Sak and Nir fails to teach, wherein the convolutional neural network comprises to hidden layers and a pooling layer between the two hidden layers.
However, Agranonik discloses a classification system and method using convolution neural networks (Abstract; Fig.3; column 2 lines 33 — 45), wherein the convolutional neural network comprises two hidden layers and a pooling layer in between the two hidden layers (Fig.3; column 7 lines 12 — 28 and column10 lines 27 — 66).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s invention to improve the invention disclosed by the combination of Ali, Sak and Ni in the same way that Agranonik’s invention has been improved so that the convolution neural network comprises two hidden layers and a pooling layer in between the two hidden layers for the purpose of estimating the quality of an automatically generated transcription without requiring a gold-standard, manually transcribed reference (Ali, Abstract and 1. Introduction, pg. 20).

Claims 6, 7, 15 and 16 are rejected under 35 U.S.C. 103 as being unpatentable over Ali (“Word Error Rate Estimation for Speech Recognition: e-WER”) (“Ali”) in view of Sak et al. (US 2018/0053500) (“Sak”), and further in view of  Nir (US 2018/00473878), and further in view of Tang et al. (US 2015/0161994) (“Tang”) and further in view of Zhang et al. (US 2019/0341058) (“Zhang”).
For claims 6 and 15, the combination of Ali, Sak and Nir fails to teach, wherein the one or more processors are configured to extract output from one or more layers of the first neural network comprises extracting outputs from a last hidden layer of the deep neural network.
However, Tang discloses a method and apparatus for speech recognition (Abstract), wherein bottleneck features are extracted from a hidden layer as speech features used for speech recognition ([0013] [0023 - 0025])
Additionally, Zhang discloses a neural network for speaker recognition (Abstract), wherein bottleneck features are extracted from any hidden layer, including a last hidden layer of a neural network ([0076] [0077]).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s invention to improve the invention disclosed by the combination of Ali, Sak and Nir in the same way that Tang’s invention has been improved to achieve the predictable results of extracting outputs from a hidden layer (bottleneck layer) of the first neural network as speech features used to train the second neural network (Ali, numerical features, basic features about the speech signal, 2.1 e-WER features) for the purpose of estimating WER which does not require a gold-standard transcription of a test set (Ali, Abstract)
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s invention to improve the invention disclosed by the combination of Ali, Sak, Nir and Tang in the same way that Zhang’s invention has been improved to achieve the predictable results of further extracting outputs  (bottleneck features) from any hidden layer, including a last hidden layer, of the first neural network as speech features used to train the second neural network (Ali, numerical features, basic features about the speech signal, 2.1 e-WER features) for the purpose of estimating WER which does not require a gold-standard transcription of a test set (Ali, Abstract).

For claims 7 and 16, the combination of Ali, Sak and Nir fails to teach, wherein the one or more processors are configured to extract output from one or more layers of the first neural network comprises extracting outputs from a first and last hidden layer of the deep neural network.
However, Tang discloses a method and apparatus for speech recognition (Abstract), wherein bottleneck features are extracted from a hidden layer as speech features used for speech recognition ([0013] [0023 - 0025])
Additionally, Zhang discloses a neural network for speaker recognition (Abstract), wherein bottleneck features are extracted from any hidden layer, including a first and last hidden layer of a neural network ([0076] [0077]).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s invention to improve the invention disclosed by the combination of Ali, Sak and Nir in the same way that Tang’s invention has been improved to achieve the predictable results of extracting outputs from a hidden layer (bottleneck layer) of the first neural network as speech features used to train the second neural network (Ali, numerical features, basic features about the speech signal, 2.1 e-WER features) for the purpose of estimating WER which does not require a gold-standard transcription of a test set (Ali, Abstract)
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s invention to improve the invention disclosed by the combination of Ali, Sak, Nir and Tang in the same way that Zhang’s invention has been improved to achieve the predictable results of further extracting outputs  (bottleneck features) from any hidden layer, including a first and last hidden layer, of the first neural network as speech features used to train the second neural network (Ali, numerical features, basic features about the speech signal, 2.1 e-WER features) for the purpose of estimating WER which does not require a gold-standard transcription of a test set (Ali, Abstract).

Claims 8 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Ali (“Word Error Rate Estimation for Speech Recognition: e-WER”) (“Ali”) in view of Sak et al. (US 2018/0053500) (“Sak”), and further in view of  Nir (US 2018/0047387), and further in view of Min (US 2017/0018270) and further in view of Nguyen et al. (US 2018/0322394) (“Nguyen”).
For claims 8 and 17, the combination Ali, Sak and Nir fails to teach, wherein the one or more processors are further configured to use an autoencoder neural network to reduce a number of input features from each segment such that a number of outputs from the first neural network are reduced.
However, Min discloses a speech recognition system and method (Abstract), wherein a feature vector of speech data used to perform speech recognition is generated by an autoencoder ([0009 – 0013] [0048] [0049]).
Additionally, Nguyen discloses a system and method for analyzing sequence data using neural networks (Abstract), wherein an autoencoder generates a feature vector representation of an input sequence which reduces a number of input features of an input sequence (the neural network 200 is an autoencoder which received an input sequence and generates a feature vector representation that is a compressed version of the input sequence, [0031] [0032] [0038] [0040]).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s filing to modify the combined teachings of Ali, Sak and Nir with Min’s teachings so that the feature vectors are further extracted from the media segments using an autoencoder and provided as input to the first neural network (speech recognition model) for the purpose of improving the efficiency of  speech recognition using a model which eliminates the need for a separate acoustic model and language model
Additionally, it would have been obvious to one of ordinary skill in the art at the time of applicant’s invention to improve the invention disclosed by the combination of Ali, Sak, Nir and Min in the same way that Nguyen’s invention has been improved to achieve the predictable results of the feature vectors generated by the autoencoder further comprising feature representations which are a reduced number of input features from each segment such that a number of outputs from the first neural network are reduced for the purpose of improving the efficiency and resource usage  of  speech recognition using a model which eliminates the need for a separate acoustic model and language model

Claims 9 and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Ali (“Word Error Rate Estimation for Speech Recognition: e-WER”) (“Ali”) in view of Sak et al. (US 2018/0053500) (“Sak”), and further in view of  Nir (US 2018/0047387), and further in view of Chiu et al. (US 11,210,475) (“Chiu”), and further in view of Nguyen et al. (US 2018/0322394) (“Nguyen”) and further in view of Perez et al. (US 2018/0203848) (“Perez”).
	For claims 9 and 18, the combination of Ali, Sak, Nir, Chiu and Nguyen fails to teach, wherein the autoencoder comprises approximately 256 channels.
However, Perez discloses a system and method for the purpose of generating embedding (Abstract), wherein a RNN based model comprises approximately 256 channels (each hidden state may be a vector of at least 256 dimensions, [0060 — 0062]).
Therefore, it would have been obvious to one of ordinary skill in the art at the time of applicant’s invention to modify the combined teachings of Ali, Sak, Nir and Nguyen with Perez’s teachings so that the autoencoder) (Nguyen, [0038] [0040]) comprises approximately 256 channels for the purpose of efficiently analyzing the sequence data used as input into the system, wherein the use of the feature vector enables efficient comparison of two sequences that may have different number of elements (Nguyen, [0002 — 0004] [0040]).


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SONIA L GAY whose telephone number is (571)270-1951. The examiner can normally be reached Monday-Friday 9-5 ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached on 571-272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/SONIA L GAY/Primary Examiner, Art Unit 2657