Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Compact Prosecution
Examiner would like to propose amending the independent claims to include the limitations “a key system components that includes automated threshold detection  and forwarding of media data and metrics which require validation and the creation of a near real-time data pipeline for data validation
wherein an offline processing takes weeks or months per iteration. This amendment will overcome the current rejection. 
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-6, 8-20 are rejected under 35 U.S.C. 103 as being unpatentable over Garera et al. (US20140297570) in view of Pallath et al. (US20170011111) .
	Claim 1, Garera discloses a system for training a model (Section 0054, lines 9-12 classifier 404 are trained using training data) comprising:
at least one receiver for receiving media data; (Section 0054, lines 14-16- prior to training the model, records (training data) are input into the machine learning algorithm) and
at least one processor (Processor 202 in Fig. 2) and at least one memory (Memory device (s) 204) containing instructions that, when executed, cause the at least one processor to:
separate the media data into one or more clusters, (Section 0054, lines 11-13- “all records in a record corpus may be classified (separated) using the trained model”) 
(the secondary reference (Pallath (20170011111) also address this limitation “separate the media data into one or more cluster”-Section 0033)
 each cluster of the one or more clusters based on a feature from a first model; (Section 0054, lines 4-9- (text) with classification value pairings indicates the common features within each class or cluster- also see Section 0061, lines 1-2)
based on an analysis of the media data determine an accuracy of the media data of each cluster the accuracy associated with the feature; (Section 0055, lines 1-3- Thus “the machine learning algorithm may associate a confidence score with a classification output as a result of the classification of records” where the score reads on the accuracy of the records classified) 
based on a subset dataset (data sent to the crowdsourcing because their score is outside the threshold- see section 0056) of the media data being outside a threshold accuracy automatically forward the subset dataset to a crowd source service; (Section 0056, lines 1-3 hence the classification that are not identified as high confidence reads on the subset dataset where their threshold are outside). 
receive verification of the subset dataset from the crowd source service; (Section 0057, lines 2-3 “a validation decision may be received from the crowdsourcing forum”) and add the verified subset dataset to the first model. (Section 0060, lines 1-2- thus “added to the training set-”)
(the training set reads on the model) 
Garera does not discloses extracting a sample of  the media data of each cluster for further analysis. 
Pallath discloses a clustering system where a sample of media data of each cluster are extracted for further consideration. (Fig. 4A Number of samples per cluster are extracted for further analysis).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching of considering a sample of a dataset for further examination. The motivation is that considering just a sample saves time and memory.  
Claim 2, Garera in view of Pallath  discloses wherein the at least one processor further configured to generate a second model based on the received verified subset dataset; (Garera: Section 0066, lines 8-10 “augmenting 506 training data which is received from an analyst workstation with high confidence data” reads on the second model)  and upon determining that an accuracy of the second model exceeds the first model, update the first model with the second model. (Garera: Section 0055, lines 2-9- thus when the classification with a confidence score above a specified threshold may be added to the training set thus updated)
Claim 3,  Garera in view of Pallath (Section 0045, lines 1-4 “score of a feature” reads on feature metric) discloses wherein the at least one processor that determines the accuracy   of the subset dataset (Garera: Section 0051, lines 1-2 training data…generated by the analysts) is configured to determine a feature metric associated with the feature from an analysis of the first model; (Garera: Section 0051, lines 1-5- thus Classification value represents a common character among the product records) 
define a centroid based on the feature metric; (Pallath: Section 0065, lines 8-10 “ a centroid of a cluster can be determined by using k-means clustering”) and after separating the media data into the one or more clusters (Classified data) determine a  measured metric associated with the media data in each cluster. (Garera: Section 0038, lines 1-4- thus classification data or records with a confidence score (measured metric) above a threshold)
Claim 4, Garera in view of Pallath discloses wherein the at least one processor is further configured to select the subset dataset (Garera: Section 0051, lines 1-2 training data…generated by the analysts) of each cluster based on the measured metric (confidence score (measured metric)) matching the feature metric (Characteristic value) associated with the centroid within a threshold. (Garera: Section 0038, lines 8-10- thus the confidence score is in between the first and the second threshold)
Claim 5, Garera in view of Pallath discloses wherein the at least one processor is further configured to  receive a labelled subset dataset from the crowd source service; (Garera: Section 0041 lines 6-8- “substitute classification” means a new class or label has been outputted by the crowdsource forum)
 (also see Section 0052, lines 3-4 “add more descriptive data to the one or more records”) 
add the labelled subset dataset to the first model to create a combined model  dataset; (Garera: Section 0042, lines 3-5- thus the new classification designated as valid by the crowdsource forum are added/combined to the training data (model) to create a new model) 
 generate the second model based on the combined model dataset; (Garera: Section 0042, lines 6-8- thus the combined training dataset reads on the new model or training dataset) and  determine the accuracy of the second model. (Section 0043, lines 8-9 “evaluation of the correctness of the validation decision”) 

Claim 6, Garera in view of Pallath discloses wherein the at least one processor is configured to generate the second model on an ongoing basis. (Garera: Section 0047, lines 6-9- thus “the analyst generate training data when appropriate” – this means the training data is generated as an ongoing basis)
Claim 8, Garera in view of Pallath discloses wherein the media data is comprised of multiple data types the multiple data types including audio, visual, and text data. (Garera: Section 0051, lines 3-5 product records of a product catalog means the product data is a text or image data) 
Claim 9, Garera in view of Pallath discloses wherein the media data is configured to be separated into the one or more clusters based on an unsupervised machine learning technique. (Garera: Section 0036, lines 5-7- “machine learning algorithm including unsupervised learning algorithm”) 
Claim 10, Garera in view of Pallath discloses wherein the at least one processor is further configured to automatically  forward the subset dataset to the crowd source service based on a volume  (Garera: Section 0056, lines 1-2 “Some or all of the Classification” reads on volume or part of the data) of the subset dataset being above the threshold accuracy. (Garera: Section 0047, lines 1-9- thus “an analyst module may select classification values or categories of classification values on the basis of a percentage of classification (threshold accuracy) … that were marked as invalid to generate a prompt transmitted or displayed to analysts (the analyst offers crowdsourcing services)”)
Claim 11, Garera in view of Pallath  discloses wherein the volume of the subset dataset that initiates forwarding to the crowd source service (Garera: Section 0056- Crowdsource services) is based on a volume heuristics model that is configured to determine an amount of data predicted to successfully update the first model. (Garera: Section 0049, lines 1-5- if the classification value percentage is above the threshold then the crowdsource is initiated to generate a training data). 
Claim 12, Garera discloses a non-transitory computer-readable medium comprising instructions, (Section 0027, lines 8-10- thus Processor which includes various types of computer readable media) when executed  by a computing system, (Section 0034, lines 1-3 executable program) the instructions cause the computing system to:
separate media data into one or more clusters, (Section 0054, lines 11-13- “all records in a record corpus may be classified (separated) using the trained model”) 
(the secondary reference (Pallath (20170011111) also address this limitation “separate the media data into one or more cluster”-Section 0033)
each cluster of the one or more clusters based on a feature from a first model; (Section 0054, lines 4-9- (text) with classification value pairings indicates the common features within each class or cluster- also see Section 0061, lines 1-2)
based on an analysis of the sampled media data, determine an accuracy of the media data of each cluster, the accuracy associated with the feature; (Section 0055, lines 1-3- Thus “the machine learning algorithm may associate a confidence score with a classification output as a result of the classification of records” where the score reads on the accuracy of the records classified)
based on a subset dataset (data sent to the crowdsourcing because their score is outside the threshold- see section 0056) of the media data being outside a threshold accuracy, automatically forward the subset dataset to a crowd source service; (Section 0056, lines 1-3 hence the classification that are not identified as high confidence reads on the subset dataset where their threshold are outside)
receive verification of the subset dataset from the crowd source service; (Section 0057, lines 2-3 “a validation decision may be received from the crowdsourcing forum”)  and  add the verified subset dataset to the first model. (Section 0060, lines 1-2- thus “added to the training set-”) (the training set reads on the model) 
Garera does not discloses extracting a sample of  the media data of each cluster for further analysis. 
Pallath discloses a clustering system where a sample of media data of each cluster are extracted for further consideration. (Fig. 4A Number of samples per cluster are extracted for further analysis).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching of considering a sample of a dataset for further examination. The motivation is that considering just a sample saves time and memory.  

Claim 13, Garera in view of Pallath discloses the instructions further configured to generate a second model based on the received verified subset dataset; (Garera: Section 0066, lines 8-10 “augmenting 506 training data which is received from an analyst workstation with high confidence data” reads on the second model)  and upon determining that an accuracy of the second model exceeds the first model, update the first model with the second model. (Garera: Section 0055, lines 2-9- thus when the classification with a confidence score above a specified threshold may be added to the training set thus updated)

Claim 14, Garera in view of Pallath (Section 0045, lines 1-4 “score of a feature” reads on feature metric) discloses wherein the instructions that determine the accuracy of the subset dataset (Garera: Section 0051, lines 1-2 training data…generated by the analysts) are configured to determine a feature metric associated with the feature from an analysis of the first model; (Garera: Section 0051, lines 1-5- thus Classification value represents a common character among the product records) 
define a centroid based on the feature metric; (Pallath: Section 0065, lines 8-10 “ a centroid of a cluster can be determined by using k-means clustering”)
after separating the media data into the one or more clusters, (Classified data) determine a measured metric associated with the media data in each cluster; (Garera: Section 0038, lines 1-4- thus classification data or records with a confidence score (measured metric) above a threshold)
and select the subset dataset of each cluster based on the measured metric matching the feature metric associated with the centroid within a threshold. (Garera: Section 0038, lines 8-10- thus the confidence score is in between the first and the second threshold)

Claim 15, Garera in view of Pallath discloses wherein automatically forwarding the subset dataset to the crowd source service is further based on a volume (Garera: Section 0056, lines 1-2 “Some or all of the Classification” reads on volume or part of the data) of the subset dataset being above the threshold accuracy. (Garera: Section 0047, lines 1-9- thus “an analyst module may select classification values or categories of classification values on the basis of a percentage of classification (threshold accuracy) … that were marked as invalid to generate a prompt transmitted or displayed to analysts (the analyst offers crowdsourcing services)”)

Claim 16, Garera discloses a method of training a model (Section 0054, lines 9-12 classifier 404 are trained using training data) comprising:
separating media data into one or more clusters, (Section 0054, lines 11-13- “all records in a record corpus may be classified (separated) using the trained model”) 
(the secondary reference (Pallath (20170011111) also address this limitation “separate the media data into one or more cluster”-Section 0033)
each cluster of the one or more clusters based on a feature from a first model; (Section 0054, lines 4-9- (text) with classification value pairings indicates the common features within each class or cluster- also see Section 0061, lines 1-2)
based on an analysis of the sampled media data, determining an accuracy of the media data of each cluster, the accuracy associated with the feature; (Section 0055, lines 1-3- Thus “the machine learning algorithm may associate a confidence score with a classification output as a result of the classification of records” where the score reads on the accuracy of the records classified) 
based on a subset dataset (data sent to the crowdsourcing because their score is outside the threshold- see section 0056) of the media data being outside a threshold accuracy, automatically forwarding the subset dataset to a crowd source service; (Section 0056, lines 1-3 hence the classification that are not identified as high confidence reads on the subset dataset where their threshold are outside).
receiving verification of the subset dataset from the crowd source service; (Section 0057, lines 2-3 “a validation decision may be received from the crowdsourcing forum”)  and adding the verified subset dataset to the first model. (Section 0060, lines 1-2- thus “added to the training set-”)
(the training set reads on the model) 
Garera does not discloses extracting a sample of  the media data of each cluster for further analysis. 
Pallath discloses a clustering system where a sample of media data of each cluster are extracted for further consideration. (Fig. 4A Number of samples per cluster are extracted for further analysis).
Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching of considering a sample of a dataset for further examination. The motivation is that considering just a sample saves time and memory.  

Claim 17, Garera in view of Pallath discloses further comprising generating a second model based on the received verified subset dataset; (Garera: Section 0066, lines 8-10 “augmenting 506 training data which is received from an analyst workstation with high confidence data” reads on the second model) and upon determining that an accuracy of the second model exceeds the first model,  update the first model with the second model. (Garera: Section 0055, lines 2-9- thus when the classification with a confidence score above a specified threshold may be added to the training set thus updated)
Claim 18, Garera in view of Pallath (Section 0045, lines 1-4 “score of a feature” reads on feature metric) discloses wherein determining the accuracy of the subset dataset   (Garera: Section 0051, lines 1-2 training data…generated by the analysts)  comprises determining a feature metric associated with the feature from an analysis of the first model; (Garera: Section 0051, lines 1-5- thus Classification value represents a common character among the product records) 
defining a centroid based on the feature metric; (Pallath: Section 0065, lines 8-10 “ a centroid of a cluster can be determined by using k-means clustering”)
after separating the media data into the one or more clusters, (Classified data) determining a measured metric (confidence score (measured metric)) associated with the media data in each cluster; (Garera: Section 0038, lines 1-4- thus classification data or records with a confidence score (measured metric) above a threshold)
and selecting the subset dataset of each cluster based on the measured metric matching the feature metric (Characteristic value) associated with the centroid within a threshold. (Garera: Section 0038, lines 8-10- thus the confidence score is in between the first and the second threshold)


Claim 19, Garera in view of Pallath discloses wherein automatically forwarding the subset dataset to the crowd source service is further based on a volume (Garera: Section 0056, lines 1-2 “Some or all of the Classification” reads on volume or part of the data) of the subset dataset being above the threshold accuracy. (Garera: Section 0047, lines 1-9- thus “an analyst module may select classification values or categories of classification values on the basis of a percentage of classification (threshold accuracy) … that were marked as invalid to generate a prompt transmitted or displayed to analysts (the analyst offers crowdsourcing services)”)

Claim 20, Garera in view of Pallath discloses wherein the volume of the subset dataset that initiates forwarding to the crowd source service is based on a volume heuristics model that determines an amount of data predicted to successfully update the first model. (Garera: Section 0049, lines 1-5- if the classification value percentage is above the threshold then the crowdsource is initiated to generate a training data). 


Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Garera et al. (US20140297570) in view of Pallath et al. (20170011111) and further in view of Senior (20150269931).
Claim 7, Garera in view of Pallath does not disclose wherein the at least one processor is configured to determine the feature from the first model based on at least one of an accent, gender, or environmental background noise in the media data.
Senior discloses a system wherein the at least one processor is configured to determine the feature from the first model based on at least one of an accent, gender, or environmental background noise in the media data. (Section 0043, lines 10-15- demographic characteristic (e.g. gender or accent)). 
Therefore it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to include the teaching of considering the gender or accent as a feature with a cluster. The motivation is that considering characteristic or feature such gender or accent will make the system effective. 


Cited Art
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.

Lev et al. 20170018269 discloses a method for generating topics, clusters, and grammars for use in configuring the self-help system. In operation a speech recognition module recognizes speech in the recorded calls stored in call recording storage to generate recognized text which is stored in recognized text storage . In operation the topic detection and tracking system detects phrases in the recognized text to generate a plurality of phrases and then clusters those phrases in operation  to generate a plurality of clusters or topics, where each of the topics includes a plurality of phrases corresponding to that topic.
Mones et al. 20170154314 discloses system that transform the unstructured data into at least one structured dataset. That is, using a machine learning algorithms of the system may support one or more data structure formats for inputting data. Consequently, the system may process the unstructured data in order to trans­form the data into a structure supported by the machine language algorithm. In some examples, this process involves visual recognition processing on visual data, speech recog­nition processing on audio data, or natural language pro­cessing on text data in order to produce a set of values or labels that are used in a data set that conforms to the data structure supported by the machine language algorithm.



Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Akwasi M Sarpong whose telephone number is (571)270-3438. The examiner can normally be reached Mon-Fri. 8:00am-4:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, KING D POON can be reached on 571-272-7440. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/AKWASI M SARPONG/Primary  Examiner, Art Unit 2675                                                                                                                                                                                                        05/05/2022