DETAILED ACTION
This office action is in response to correspondence filed on 3/2/2021 in reference to application.  
The Amendment filed on 3/2/2021 has been entered.  
Claims 2-21 remain pending in the application of which Claims 2, 9, and 16 are independent.  

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Examiner’s Amendment
An examiner’s amendment to the record appears below. Should the changes and/or additions be unacceptable to applicant, an amendment may be filed as provided by 37 CFR 1.312. To ensure consideration of such an amendment, it MUST be submitted no later than the payment of the issue fee.

Authorization for this examiner’s amendment was given telephonically as well as electronically by the Applicant's Attorney, KEAT A. QUINALTY (Reg. No. 46,426) on 4/14/2021.  The attorney accepted the examiner suggested amendments in the language of the independent claims 2, 9, and 16, for better explaining the inventive concept and for overcoming the prior art of record.  
          The application has been amended as follows:
Replace the claims 2, 9, and 16 with the following claims below, with strikethrough (--) and double square parentheses ([[ ]]) representing deletions and underlined sections representing additions.

2. (Currently Amended) A computer system, comprising: 
a processor operable: 
to differentiate between multiple speakers in an audio stream of speech based in part on frequencies in the audio stream, wherein said audio stream of speech comprises audible speech of at least a first speaker and a second speaker, 
to convert the audio stream into text, and 
to generate time stamps in the audio stream to associate the text with the audio stream; and 
a machine learning module implemented by one or more processors 
to access pre-learned phonemes, 
to identify the first speaker in the audio stream based on the pre-learned phonemes, 
to locate a portion of the text associated with the first speaker based on the time stamps, to segment the text associated with the first speaker into text phonemes, [[and]]

to correct the text associated with the first speaker in real-time by comparing the text phonemes with phonetically-similar letter pairs of the pre-learned phonemes and applying one or more filters to the text to generate a clean transcript, and 
to execute a transaction based on the clean transcript.



9. (Currently Amended) A method, comprising: 

converting the audio stream into text; 
generating time stamps in the audio stream to associate the text with the audio stream; accessing pre-learned phonemes; 
identifying the first speaker in the audio stream based on the pre-learned phonemes; 
locating a portion of the text associated with the first speaker based on the time stamps; 
segmenting the text associated with the first speaker into text phonemes; [[and]] 
ing the text associated with the first speaker in real-time by comparing the text phonemes with phonetically-similar letter pairs of the pre-learned phonemes and applying one or more filters to the text to generate a clean transcript; and
executing a transaction based on the clean transcript.



16. (Currently Amended) A non-transitory computer readable medium comprising instructions that, when executed by a multi-core processor, direct the multi-core processor to: 
differentiate between multiple speakers in an audio stream of speech based in part on frequencies in the audio stream, wherein said audio stream of speech comprises audible speech of at least a first speaker and a second speaker; 
convert the audio stream into text; 
generate time stamps in the audio stream to associate the text with the audio stream; 

locate a portion of the text associated with the first speaker based on the time stamps; 
segment the text associated with the first speaker into text phonemes; [[and]] 
by comparing the text phonemes with phonetically-similar letter pairs of the pre-learned phonemes and applying one or more filters to the text to generate a clean transcript; and 
execute a transaction based on the clean transcript.


Allowable Subject Matter
Claims 2-21 are allowed over the prior art of record.  The following is the examiner’s statement of reasons for allowance:
The closest relevant prior art (which is discussed in further detail below), either taken individually or in combination, fails to explicitly teach or reasonably suggest the invention as represented by the independent claims 2, 9, and 16.

Most pertinent prior art:
MCLAREN (US 2016/0248768 A1) discloses a computer system, comprising: 
a processor (MCLAREN Fig. 6 – “Processor 612”; Par 52 – “The illustrative computing device 610 includes at least one processor 612 (e.g. a microprocessor, microcontroller, digital signal processor, etc.), memory 614, and an input/output (I/O) subsystem 616.”) operable 
to differentiate between multiple speakers (MCLAREN Par 44 – “The system 100 may generate a biometric score by computing the similarity between the current model 124 and the stored models 126 mathematically, e.g., using PLDA. If there are multiple stored models 126, the system 100 may analyze the similarity of the current model 124 to each of the stored models or a subset of the stored models 126 and make a match determination based on the similarity of the current model 124 to all or a subset of the stored models 126.”; Figs.5A-5C – “Speaker 1”, “Speaker 2”, and “Speaker 3”; Par 30 – “A DNN is trained to discriminate between different output classes such as senones, speakers, conditions, etc.”) in an audio stream of speech based in part on frequencies in the audio stream (MCLAREN Par 22 – “A neural network-based acoustic model 116 of the speech recognizer 114 generates a bottleneck feature 117 output that is combined with cepstral features 118 separately derived from the current speech sample 130, which combined features are used to create a joint speaker and content model of the current speech, 124. The combination of bottleneck features 117 and cepstral features (e.g., the known Mel frequency cepstral coefficient (MFCC) or pcaDCT) allows for generation from the combination of a phonetic model (such as an i-vector) capable of analysis for both speaker identification and phonetic or text identification.”), <wherein said audio stream of speech comprises audible speech of at least a first speaker and a second speaker> (MCLAREN Par 36 – “Additionally, the analysis of the data may include separation of the authorized user's speech data from contemporaneously captured speech from other human speakers. The phonemes associated with the other speaker's voices will result in i-vectors sufficiently different from those trained on the authorized user's voice to not result in a match, even if the other speakers speak registered commands for the associated device.”),
to convert the audio stream into text (MCLAREN Fig. 3 – “[t], [eh], [en]”; Par 33 – “Referring now to FIG. 3, a simplified illustration of a phonetic representation of a speech sample 300, which may be created by the computing system 100, is shown. The speech sample 300 is divided into time slices, and each time slice contains a speech segment, e.g., a portion of the acoustic signal, 310, 312, 314.”; Par 47 – “In the flow chart of FIG. 5A, speaker 1 says “drive to work.” The car's spoken command analyzer receives the audio of the spoken command and determine that the short phase includes the text command of “drive to work” and determines the identity of the speaker as speaker 1.”), and 
to generate time stamps in the audio stream to associate the text with the audio stream (MCLAREN Fig. 3; Par 21 – “A speech segment may be referred to as a “time slice” or “frame” of the audio (speech) signal. The illustrative speech recognizer 114 aligns time with the phone-level content of the speech sample 130 so that the phonemic or phonetic content of each speech segment can be determined in the context of the temporally preceding and/or subsequent phonemic or phonetic content.”; Par 33 – “The speech sample 300 is divided into time slices, and each time slice contains a speech segment, e.g., a portion of the acoustic signal, 310, 312, 314. … The notation “b” refers to the beginning pronunciation of the phone, “m” refers to the middle portion of the phone, and “e” refers to the end of the phone 318.”; Par 41 – “At block 414, the system 100 identifies the temporal speech segments (or “time slices” or “frames”) of the current speech sample. The temporal speech segments may correspond to, for example, the sampling rate of the ADC or a multiple of the sampling rate.”); and 
a machine learning module implemented by one or more processors (MCLAREN Fig. 1; Par 8 – “FIG. 1 comprises a simplified module diagram of an environment of at least one embodiment of a computing system for performing phonetically-aware command/speaker recognition as disclosed herein in accordance with various embodiments of the invention;”)
to access pre-learned phonemes (MCLAREN Par 19 – “A front end module 112 of the spoken command analyzer system 110 uses training data (one or more speech samples collected from the user) to create and store one or more joint content and speaker models 126 of the training data. This can be done during an enrollment process or passively during normal use of the user's device, for example. The stored joint content and speaker model 126 models both content specific and speaker specific features (e.g., acoustic properties) extracted from the user's training data. The stored joint content and speaker model 126 may be referred to herein as, for example, a phonetic model.”; Par 23 – “To generate the joint command and speaker the stored phonetic model(s) 126 using speaker-specific production of the phonemic, phonetic, or lexical content (e.g., at the phone or tri-phone level), rather than simply relying on the traditional acoustic features alone.”; Par 43 – “At block 420, the computing system 100 retrieves one or more stored speaker model(s) (e.g., stored speaker models 126) from, for example, memory or data storage. The illustrative stored speaker model(s) were created using a process substantially matching that which created the phonetic model of the current speech 124.”), 
to identify the first speaker in the audio stream based on the pre-learned phonemes (MCLAREN Par 22 – “The combination of bottleneck features 117 and cepstral features (e.g., the known Mel frequency cepstral coefficient (MFCC) or pcaDCT) allows for generation from the combination of a phonetic model (such as an i-vector) capable of analysis for both speaker identification and phonetic or text identification. These features are also provided to a statistics generator 119, which generates statistics 136 relating to the various features that can be further used for creation of the phonetic model. The statistics generator 119 may rely on the universal background model (UBM) to generate the described statistics.”; Par 34 – “Because the compared i-vectors includes the described feature sets with both text and speaker depending features, this i-vector comparison and analysis jointly determines a match both for substantive content and for speaker identification. Thus, the i-vector creation and comparison inherently includes confirmation of both the substantive content and the speaker's identity without performing separates analyzes of the speech data.”; Par 36 – “The phonemes associated with the other speaker's voices will result in i-vectors sufficiently different from those trained on the authorized user's voice to not result in a match, even if the other speakers speak registered commands for the associated device.”; Par 44 – “For example, if both the current speech sample and a stored speech sample include the word “cat” (a tri-phone), the system 100 may analyze the similarity of the phonetic content of the word “cat” in the current model with the phonetic content of the word “cat” in the stored model. … If there are multiple stored models 126, the system 100 may analyze the similarity of the current model 124 to each of the stored models or a subset of the stored models 126 and make a match determination based on the similarity of the current model 124 to all or a subset of the stored models 126.”), 
to locate a portion of the text associated with the first speaker based on the time stamps (MCLAREN Par 41 – “If the system 100 determines that command/speaker recognition is to be performed, the system 100 proceeds to block 414. At block 414, the system 100 identifies the temporal speech segments (or “time slices” or “frames”) of the current speech sample. The temporal speech segments may correspond to, for example, the sampling rate of the ADC or a multiple of the sampling rate. In some cases, overlapping speech segments may be used to ensure that an important feature of the signal is not missed.”), 
to segment the text associated with the first speaker into text phonemes (MCLAREN Fig. 3; Par 33 – “The speech sample 300 is divided into time slices, and each time slice contains a speech segment, e.g., a portion of the acoustic signal, 310, 312, 314. Each speech segment is associated with corresponding speech content, in this case, the phones 316, 318, 320. Additionally, with reference to the phone 318, the phonetic state or context is illustrated in more detail (although the same analysis applies to any of the phones 316, 318, 320). The notation “b” refers to the beginning pronunciation of the phone, “m” refers to the middle portion of the phone, and “e” refers to the end of the phone 318.”).  

MCLAREN teaches using the user-specific pre-learned model to accurately authenticate the user.  However, MCLAREN does not explicitly teach correcting the transcription, as recited in the independent claims.




Any comments considered necessary by Applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee.  Such submissions should be clearly labeled "Comments on Statement of Reasons for Allowance." 


Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.  Please see attached from PTO-892.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to JONATHAN C. KIM whose telephone number is (571)272-3327.  The examiner can normally be reached on Monday to Friday 9:00 AM thru 5:30 PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Pierre-Louis Desir can be reached on 571-272-7799.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR 






/JONATHAN C KIM/Primary Examiner, Art Unit 2659