DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This Office Action is in response to the submission filed December 16, 2022.  Claims 1-20 are pending.

Information Disclosure Statement
The information disclosure statement (IDS) submitted on December 16, 2022 and June 14, 2022 is being considered by the examiner.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1 and 11 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more. The claims recite limitations for generating augmented sets of training data.  The features for storing a set of audio training data that includes a plurality of audio segments and metadata indicating a word or phrase associated with each audio segment is a data gathering and data organization step that can be achieved by a human mentally remembering spoken phrases and metadata associated with the phrases; the step for receiving a set of structured text data that includes one or more target training statements that each include a plurality of text segments comprising a word or phrase is a data gathering step that can be achieved by the human reading a corpus; for a target training statement of the set of structured text data, generating a concatenated audio signal that matches a word content of the target training statement by: comparing the words or phrases of the plurality of text segments of the target training statement to respective words or phrases of audio segments of the stored set of audio training data can be achieved by the human mentally comparing text in the corpus to the remembered phrases to identify any matches; selecting a plurality of audio segments from the set of audio training data based on a match in the words or phrases between the plurality of text segments of the target training statement and the selected plurality of audio segments can be achieved by the identifying the matches and using pen and paper making a list of the matches using the text and  phrase ID; concatenating the selected plurality of audio segments into the concatenated audio signal can be achieved by the human speaking the selected segments as a combined phrase and remembering the combined phrase and associate the combined phrase with a new ID; and generating an augmented set of training data that includes the set of structured text data paired with respective concatenated audio signals can be achieved by the human, using pen and paper, create a new list that links the text with the combined phrase ID.   The recited limitations are directed a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of the generic computing device and processor.  If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas.  Accordingly, the claims recite an abstract idea.
This judicial exception is not integrated into a practical application because the recited generic computing device and a processor amounts to no more than mere instructions to apply the exception using generic computer components.  Accordingly, the elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea.  The claims are directed to an abstract idea.  The claims are not patent eligible.
The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception because, as indicated with respect to integration of the abstract idea into a practical application, the additional elements of the computing device and processor to perform the various steps amounts to no more than mere instructions to apply the exception using generic computer components.  Mere instructions to apply an exception using generic computer components cannot provide an inventive concept.  The claims are not patent eligible.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1 and 11 are rejected under 35 U.S.C. 102(a)(1)/(a)(2) as being anticipated by Weber et al (US Patent NO. 9,495,955), hereinafter Weber.
Weber discloses a system and method for acoustic model training.  Regarding claims 1 and 11, Weber a computer system and method [Figs 1, 2, 5; col. 3, lines 36-46; col. 4, lines 5-37: lines 60-64; col. 5, line 66 continuing to col. 6, line 2; col. 9, lines 14-35; col. 10, lines 7-51] comprising a processor of a computer device [col. 10, lines 7-51 –processing unit 502] configured to storing a set of audio training data that includes a plurality of audio segments and metadata indicating a word or phrase associated with each audio segment [corpus 105; col. 2, lines 10-13 -- obtaining training data from a pre-existing corpus of audio data and corresponding transcription data (such as audiobooks or movie soundtrack/script combinations); col. 3, lines 11-23 --corpus data may include metadata;  col. 4, lines 3-37; 60-64; col. 5, line 66 to col. 6, line 2]; receiving a set of structured text data that includes one or more target training statements that each include a plurality of text segments comprising a word or phrase [col. 9, lines 6-18 – script of utterances desired to be trained]; for a target training statement of the set of structured text data, generating a concatenated audio signal that matches a word content of the target training statement by [col. 9, lines 1-47 --  the utterance selection module 115 or another component executing the acoustic model training process 400 may isolate and concatenate the two matching fragments from the two separate phrases together to create a synthesized piece of training data]: comparing the words or phrases of the plurality of text segments of the target training statement to respective words or phrases of audio segments of the stored set of audio training data [col. 9, lines 1-47 --  matching fragments or segments of the phrase may be identified from one or more sources in the corpus 105. For example, one source in the corpus 105 may include the phrase “I never received your phone call” and another source in the corpus 105 may include the phrase “I saw you open my package.” The two phrases do not include the complete desired training phrase. However, each of the two phrases includes a matching fragment of the desired training phrase (“I never received” and “my package,” respectively)];  selecting a plurality of audio segments from the set of audio training data based on a match in the words or phrases between the plurality of text segments of the target training statement and the selected plurality of audio segments [col. 9, lines 1-47 -- the utterance selection module 115 or another component executing the acoustic model training process 400 may isolate and concatenate the two matching fragments from the two separate phrases together to create a synthesized piece of training data]; concatenating the selected plurality of audio segments into the concatenated audio signal [col. 9, lines 1-47 – selection module isolates and concatenates the two matching fragments]; and generating an augmented set of training data that includes the set of structured text data paired with respective concatenated audio signals [col. 9, lines 1-47 – the utterance selection module 115 or another component executing the acoustic model training process 400 may isolate and concatenate the two matching fragments from the two separate phrases together to create a synthesized piece of training data ---where the structured text data “I never received my package” has the accompanying audio training data generated from the concatenation of the two matching fragments].

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims  2-10 and 12-20 are rejected under 35 U.S.C. 103 as being unpatentable over Weber in view of Datta et al (US Patent Application Publication No. 2021/0233510), hereinafter Datta.
Regarding claims 2, 12, and 19, Weber teaches A computer device [col. 10, lines 7-51] comprising: a processor [col. 10, lines 7-51 –processing unit 502] configured to: determine a set of structured text data for training an artificial intelligence model that is used by an automatic speech recognition application [col. 2, line 19], the set of structured text data including one or more target training states that each include a plurality of text segments comprising a word or phrase [corpus 105; col. 2, lines 10-13 -- obtaining training data from a pre-existing corpus of audio data and corresponding transcription data (such as audiobooks or movie soundtrack/script combinations); col. 3, lines 11-23 --corpus data may include metadata;  col. 4, lines 3-37; 60-64; col. 5, line 66 to col. 6, line 2; col. 9, lines 6-18 – script of utterances desired to be trained]; send the set of structured text data to a server device [col. 10, lines 63-65 – servers for distributed computing environment] to cause the server device to generate an augmented set of training data that includes the set of structured text data paired with respective concatenated audio signals [col. 9, lines 1-47 – the utterance selection module 115 or another component executing the acoustic model training process 400 may isolate and concatenate the two matching fragments from the two separate phrases together to create a synthesized piece of training data ---where the structured text data “I never received my package” has the accompanying audio training data generated from the concatenation of the two matching fragments], wherein a concatenated audio signal that matches a word content of a target training statement of the set of structured text data is generated by: comparing the words or phrases of the plurality of text segments of the target training statement to respective words or phrases of audio segments of a stored set of audio training data that includes audio segments and metadata indicating a word or phrase associated with each audio segment [col. 9, lines 1-47 --  matching fragments or segments of the phrase may be identified from one or more sources in the corpus 105. For example, one source in the corpus 105 may include the phrase “I never received your phone call” and another source in the corpus 105 may include the phrase “I saw you open my package.” The two phrases do not include the complete desired training phrase. However, each of the two phrases includes a matching fragment of the desired training phrase (“I never received” and “my package,” respectively)]; selecting a plurality of audio segments from the set of audio training data based on a match in the words or phrases between the plurality of text segments of the target training statement and the selected plurality of audio segments [col. 9, lines 1-47 --  the utterance selection module 115 or another component executing the acoustic model training process 400 may isolate and concatenate the two matching fragments from the two separate phrases together to create a synthesized piece of training data]; and concatenating the selected plurality of audio segments into the concatenated audio signal [col. 9, lines 1-47 – selection module isolates and concatenates the two matching fragments].    Weber fails to teach the training is implemented within an end-to-end artificial intelligence model.  In a similar field of endeavor, Datta teaches obtaining a plurality of training data sets for an end-to-end speech recognition model [Abstract; para 0007; para 0011; para 0036] and specifically teaches  end-to-end (E2E) models have shown great promise for ASR, exhibiting improved word error rates (WERs) and latency metrics as compared to conventional on-device ASR systems [para 0005]. One having ordinary skill at the time of the invention would have recognized the advantages of implementing the end-to-end speech recognition model of Datta, with the recognition training system of Weber, for the purpose of providing an improved speech recognition system of improved WERs and latency, as suggested by Datta.
Regarding claims 3 and 13, the combination of Weber and Datta teaches the set of audio training data is for one or more domains, and the set of structured text data is for a target domain that is different than the one or more domains of the set of audio training data [Weber’s domain characteristics --  col. 2, lines 31-47; col 3, lines 32-35; col. 4, lines 12-16 – multiple corpus sources; 4, lines 51-54].
Regarding claims 4 and 14, the combination of Weber and Datta teaches the set of audio training data and the set of structured text data are for a same domain [Weber’s domain characteristics --  col. 2, lines 31-47; col 3, lines 32-35; col. 4, lines 12-16 – multiple corpus sources; 4, lines 51-54 – where utilizing audio and text data from the same domain is an obvious step so as to ensure the keywords and phrases associated with the domain can be detected within the data sets so as to generate the augmented training data] .
Regarding claims 5 and 15, the combination of Weber and Datta teaches  the set of audio data includes a plurality of subsets of audio data for different acoustic parameters, wherein the plurality of subsets of audio data include a plurality of audio segments that are associated with a same word or phrase and different acoustic parameters [Weber’s characteristics determination – col. 2, lines 31-47 – pitch/intonation/accent/inflection/prosody characteristics; col 4: lines 38-59; 8: lines 14-32 -- utterance selection module 115 identifies utterances or portions of utterances in the corpus data having the one or more desired training characteristics.].
Regarding claim 6, the combination of Weber and Datta teaches the different acoustic parameters are selected from the group consisting of a background noise parameter, an audio quality parameter, and a speech accent parameter [Weber’s characteristics determination – col. 2, lines 31-47 – pitch/intonation/accent/inflection/prosody characteristics…noise environment characteristics; col 4: lines 38-59; 8: lines 14-32 -- utterance selection module 115 identifies utterances or portions of utterances in the corpus data having the one or more desired training characteristics.]
Regarding claims 7 and 16, the combination of Weber and Datta teaches the plurality of subsets of audio data are recorded from a plurality of different speakers [Weber col. 3, lines 16-23 – corpus of movie soundtrack of different cast members speaking various lines; col. 9, lines 36-38] .
Regarding claim 8, the combination of Weber and Datta teaches train the end-to-end artificial intelligence model  [Datta’s E2E model training –Fig 2, training process 200] using the generated augmented set of training data that includes concatenated audio signals comprising concatenated audio segments recorded from the plurality of different speakers [Weber – col. 3, lines 16-23 – corpus of movie soundtrack of different cast members speaking various lines ---where the data from the multiple speakers provide different acoustic parameters; Col. 8, lines 40-44 -- Once the data from the corpus 105 having the desired training characteristic is selected or identified, the process 400 proceeds to generate an acoustic model or update an existing acoustic model using the identified corpus data; col. 9, lines 1-47 --  the utterance selection module 115 or another component executing the acoustic model training process 400 may isolate and concatenate the two matching fragments from the two separate phrases together to create a synthesized piece of training data; col. 9, lines 36-38].
Regarding claims 9 and 17, the combination of Weber and Datta teaches generate the concatenated audio signal that matches the word or phrase of the target training statement by selecting the plurality of audio segments from the set of audio training data to include at least two audio segments that are selected from different subsets of audio data for different acoustic parameters [Weber – col. 3, lines 16-23 – corpus of movie soundtrack of different cast members speaking various lines ---where the data from the multiple speakers provide different acoustic parameters; Col. 8, lines 40-44 -- Once the data from the corpus 105 having the desired training characteristic is selected or identified, the process 400 proceeds to generate an acoustic model or update an existing acoustic model using the identified corpus data; col. 9, lines 1-47 --  the utterance selection module 115 or another component executing the acoustic model training process 400 may isolate and concatenate the two matching fragments from the two separate phrases together to create a synthesized piece of training data; col. 9, lines 36-38—different sources/different speakers: Datta’s multiple sets of data from different speakers of different languages – para 0036].
Regarding claims 10 and 18, the combination of Weber and Datta teaches select the plurality of audio segments from the set of audio training data based on a distribution parameter that biases the selection for a target acoustic parameter [Weber col. 6, lines 11-41 – list of desired characteristics… the characteristic determination module 110 identifies data from the corpus 105 having one or more characteristics from the received list].
Regarding claim 20, the combination of Weber and Datta teaches receive an updated end-to-end artificial intelligence model  [Datta’s E2E model training –Fig 2, training process 200] that has been trained using the augmented set of training data [Weber col. 5, lines 25-29 -- acoustic model generator 120 may be configured to generate one or more acoustic models 130 or adapt an existing acoustic model based on the utterances from the corpus 105 identified as having the one or more desired characteristics 125; col. 9, line 66 to col. 10, line 5] ; and cause the automatic speech recognition application to execute using the updated end-to-end artificial intelligence model [Weber col. 9, line 66 to col. 10, line 5 --  specialized acoustic model and the generic acoustic model may be packaged together and used in combination to perform speech recognition].


Conclusion

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Thomson et al (US Patent Application Publication No. 2020/0175961) teaches training of speech recognition systems.



Any inquiry concerning this communication or earlier communications from the examiner should be directed to ANGELA A ARMSTRONG whose telephone number is (571)272-7598. The examiner can normally be reached M,T,TH,F 11:30-8:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre Desir can be reached on 571-272-7799. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

ANGELA A. ARMSTRONG
Primary Examiner
Art Unit 2659



/ANGELA A ARMSTRONG/Primary Examiner, Art Unit 2659