DETAILED ACTION
Notice of Pre-AIA  or AIA  Status

The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

	Continued Examination Under 37 CFR 1.114

A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 9/16/2022 has been entered.
 
Claim Rejections - 35 USC § 103

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-15 are rejected under 35 U.S.C. 103 as being unpatentable over Weinstein (20150039299) in view of Biadsy (20150287405) in view of Olson (20080146895).

As per claim 1, Weinstein (20150039299) teaches a modeling method for speech recognition, comprising:
processing first speech data of Mandarin ( as operating on Mandarin and Catonese regions – para 0081) and first speech data of P 
and counting the obtained tags and performing deduplication on tags of each type to determine N types of tags, N being a positive integer and P being a positive integer (as, reducing the number of vectors (which includes the label/tag information) by using pre-computed vectors so as to save computation time – para 0061); 
training a neural network according to second speech data of Mandarin (as operating on accents including Mandarin – para 0081), 
and generating a recognition model when the neural network converges, wherein outputs of the recognition model are the N types of tags (as, the output of the neural network, as well as the output of the classifier can identify different languages and/or accents – para 0037; in view of para 0045, 0025, there would be 3 types of accent tags and 5 types of language tags, as an example from para 0037);
inputting second speech data of the P dialects into the recognition model for processing respectively to obtain an output tag of each frame of the second speech data of each dialect (as, part of the ‘speech data’ – see para 0021 – the speech data includes acoustic audio signal, context information including geographic, ip address, and accent; as inputs to the neural network);
determining, according to the output tags and tagged true tags of the second speech data of each dialect, an error rate of each type of the N types of tags for each of the P dialects (as, the output of the neural network, is used to measure an error rate, and backpropagated throughout the network to update the weighting – para 0100; the weights operate on the finite state transducers which includes phonetic units, grammar, and language model – para 0036, including accent information – para 0026; also see the i-vector representing all features of the speech signal – para0049, 0050; and hence, the disclosed error calculation backpropagates on all of these labels/tags of the speech data),
and generating M types of target tags according to M types of tags whose error rates are greater than a preset threshold, M being an integer greater than or equal to zero (as measuring an error rate through the neural network – para 0100; these error rates are calculated on the label for the feature vector – para 0100;the labels are up to 39 dimensions per acoustic feature vector – para 0045; and the one of the feature vectors is the number of accents/dialects – see para 0025, wherein accent data applies to different types of English – para 0025, and Mandarin/Catonese – para 0081);
and training an acoustic model according to third speech data of Mandarin and third speech data of the P dialects, wherein outputs of the acoustic model are the N types of tags and the M types of target tags corresponding to each of the P dialects (as, explained in detail above, the number of accents/dialects and labels of the audio data; using these audio sections to train the audio/speech models as well as the context classification via classifiers, and other categories – para 0038, and the example given shows calculated probabilities for certain languages – end of para 0038). 
	As noted above, Weinstein (20150039299) discusses as part of the labels/tag input into the neural network, speech features such as derived languages/accents, but does not explicitly label these as ‘dialects’; Biadsy (20150287405) teaches labeling input acoustic information for dialects as well (Fig. 5, subblock 504, part of the language model; also see para 0006, 0010.  Therefore, it would have been obvious to one of ordinary skill in the art of language modeling to modify the feature space of Weinstein (20150039299) to include dialects, as taught by Biadsy (20150287405), because it would advantageously improve upon speech recognition in language that multiple dialects (Biadsy (20150287405), para 0005).  The combination of Weinstein (20150039299) in view of Biadsy (20150287405) teaches an error rate, as noted above in Weinstein, but does not explicitly teach that the error rate is determined by a ratio of, misclassifications of the data to the total amount of classification data; Olson (20080146895) teaches, for detection of correct data, a ratio of the mismatches to the correct matches, to generate a probability of low-risk or high-risk match (para 0085).  Therefore, it would have been obvious to one of ordinary skill in the art of data analysis to improve upon the error correction of Weinstein to include a ratio calculation of mismatches to correct matches, as taught by Olson (20080146895) because it would advantageously improve upon the accuracy of classification of the result – para 0049, end of paragraph – in Olson, an accurate measure of the result is critical in determining what category the data belongs – low, medium, high risk.
       
As per claim 2, the combination of Weinstein (20150039299) in view of Biadsy (20150287405) in view of Olson (20080146895) teaches the method of claim 1, wherein inputting the second speech data of the P dialects into the recognition model for processing respectively, to obtain the output tag of each frame of the second speech data of each dialect comprises:
extracting a filter bank coefficient characteristic of the second speech data (Weinstein 20150039299, filterbanks operating on the speech data –para 0044) of the P dialects, 
and determining N posterior probabilities of each frame of the second speech data of each dialect according to the filter bank coefficient characteristic (Weinstein 20150039299, operating on speech acoustic data, the neural network generating posterior probabilities – para 0056, 0057); 
and determining a tag corresponding to a maximum posterior probability in the N posterior probabilities as an output tag of a frame of the second speech data corresponding to the N posterior probabilities (Weinstein 20150039299, as posterior probabilities calculated on the feature vectors – para 0073; and as explained above, the features in the feature vectors, include values/tags on how many hits are associated with a differing language and or accent – para 0083; and encoding vectors into the neural network that track the language/accent – para 0084); and, as, tracking the relationship/alignment with the acoustic features of the input speech – para 0031, along with the word spoken, speaking style, speakers gender and speakers accent – para 0032) . 

As per claim 3, the combination of Weinstein (20150039299) in view of Biadsy (20150287405) in view of Olson (20080146895) teaches the method of claim 1, wherein training the acoustic model according to the third speech data of Mandarin and the third speech data of the P dialects comprises:
generating training samples according to the third speech data of Mandarin, first tagged tags corresponding to the third speech data of Mandarin, the third speech data of the P dialects and second tagged tags corresponding to the third speech data of the P dialects (Weinstein 20150039299, as, the number of accents/dialects and labels of the audio data; using these audio sections to train the audio/speech models as well as the context classification via classifiers, and other categories – para 0038, and the example given shows calculated probabilities for certain languages – end of para 0038); 
for the third speech data of each of the P dialects, replacing the M types of tags originally tagged whose error rates are greater than the preset threshold with corresponding M types of target tags to obtain updated training samples (Weinstein 20150039299, as measuring an error rate through the neural network – para 0100; these error rates are calculated on the label for the feature vector – para 0100;the labels are up to 39 dimensions per acoustic feature vector – para 0045; and the one of the feature vectors is the number of accents/dialects – see para 0025, wherein accent data applies to different types of English – para 0025, and Mandarin/Catonese – para 0081); 
and training a processing parameter of a preset model according to a preset objective function and the updated training samples to obtain the acoustic model (Weinstein 20150039299, as, explained in detail above in claim 1, the number of accents/dialects and labels of the audio data; using these audio sections to train the audio/speech models as well as the context classification via classifiers, and other categories – para 0038, and the example given shows calculated probabilities for certain languages – end of para 0038).  

As per claim 4, the combination of Weinstein (20150039299) in view of Biadsy (20150287405) in view of Olson (20080146895) teaches the method of claim 1, before processing the first speech data of Mandarin and the first speech data of the P dialects respectively based on the pre-trained alignment model, further comprising: 
obtaining fourth speech data of Mandarin and corresponding text information (Weinstein 20150039299, as noted above, in one example, applying the system to Mandarin and Catonese – para 0081);
and extracting a MFCC characteristic of each frame of the fourth speech data (Weinstein 20150039299, as using MFCC features – para 0044), 
and generating, according to the MFCC characteristic and the text information, the alignment model by training a parameter of a Gaussian mixture model based on maximum likelihood estimation (Weinstein 20150039299, as using a gaussian mixture model in the acoustic space – para 0033, which uses the MFCC characteristics derived in the acoustic space  -- para 0044). 

As per claim 5, the combination of Weinstein (20150039299) in view of Biadsy (20150287405) in view of Olson (20080146895) teaches the method of claim 1, after generating the M types of target tags according to the M types of tags whose error rates are greater than the preset threshold, further comprising: 
updating a decoding dictionary according to the M types of target tags (Weinstein 20150039299, para 0030 teaches that, part of the encoding side, is the audio signal, context information, and tag identifiers – see fig. 4, subblock 408 – these are inputs into the neural network fig. 4, subblock 410, and the neural network ‘decodes’ the audio/context/tags into a transcription (fig. 4, subblock 410). 

Claims 6-10 are device claims that perform the method steps of claims 1-5 and as such, claims 6-10 are similar in scope and content to method claims 1-5 and therefore, claims 6-10 are rejected under similar rationale as presented against claims 1-5 above.  Furthermore, Weinstein 20150039299, teaches a processor/memory executing code (para 0109 – hardware, memory device, and computer readable medium.
 
Claims 11-15 are computer readable medium claims executing code via processor, performing the method steps of claims 1-5 and as such, claims 11-15 are similar in scope and content to method claims 1-5 and therefore, claims 11-15 are rejected under similar rationale as presented against claims 1-5 above.  Furthermore, Weinstein 20150039299, teaches a processor/memory executing code (para 0109 – hardware, memory device, and computer readable medium.

Response to Arguments

Applicant's arguments filed 09/16/2022 have been fully considered but they are not persuasive.  As per applicants arguments against the weighting in Weinstein (20150039299), examiner argues the windowing function of the ivectors as part of the neural network, are weighting effects (see Weinstein, para 0050-0058).  As to the ratio calculation for error rates, see the newly applied Olson reference.


Conclusion

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
The following references were found to perform language modeling processing dialects, with a combination of the claimed acoustic processing features, as well as context features:
Thomson (20200175961) teaches dialect/accent (para 0205) along with acoustic features (para 0255 – acoustic models), and neural network based rescoring language model – para 0256-0258)

Zadeh (20180204111) teaches dialect listings (para 1327) as part of the feature space for acoustic modeling (para 2296)
 
See also 20180366112, 20170092268, 20160247501, 20050055209, teaching various element of language modeling, including labeling, tagging, neural networks, context driven feature spaces,etc.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Michael Opsasnick, telephone number (571)272-7623, who is available Monday-Friday, 9am-5pm. 
If attempts to reach the examiner by telephone are unsuccessful, the examiner's supervisor, Mr. Richemond Dorvil, can be reached at (571)272-7602.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).
/Michael N Opsasnick/Primary Examiner, Art Unit 2658                                                                                                                                                                                                        10/07/2022