DETAILED ACTION
This is the response to applicant’s amendment action regarding application number 16/545,708, filed August 20, 2019.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendments
The amendment filed September 7, 2022 has been entered. Examiner acknowledges receipt of Amendments to Application 16/545,708, which include: Amendments to the Claims, and Remarks containing Applicant’s amendments. 
Regarding Applicant’s Remarks, Examiner acknowledges Claims 21-22, 32-33, and 37-38 have been amended, with Claims 1-20 previously cancelled. Claims 21-40 remain pending in the application. 

Response to Arguments
Examiner acknowledges receipt of Arguments to Application 16/545,708, which include: Remarks containing Applicant’s arguments. 
Regarding Applicant’s Remarks for Claims 21-25, 28-29, 31-34, 36-38 and 40 under 35 U.S.C. 103 as being unpatentable over Honda et al., U.S. PGPUB 2019/0277913, filed 3/8/2019 [hereafter referred as Honda], in view of Bilenko et al., U.S. PGPUB 2014/0337096, published 11/13/2014 [hereafter referred as Bilenko], in further view of Graefe, Goetz, Query Evaluation Techniques for Large Databases, June 1993 [hereafter referred as Graefe], in even further view of Brownlee, Jason, Bagging and Random Forest Ensemble Algorithms for Machine Learning, retrieved from web.archive.org dated June 25, 2019 [hereafter referred as Brownlee], in even further view of Kaempf, Ulrich, The Binomial Test: A Simple Tool to Identify Process Problems, May 1995 [hereafter referred as Kaempf]; for Claim 30 under 35 U.S.C. 103 as being unpatentable over Honda in view of Bilenko, in further view of Graefe, in even further view of Brownlee, in even further view of Kaempf as applied to Claim 21; in even further view of Won et al., Random Forest Model for Silicon-to-SPICE Gap and FinFET Design Attribute Identification, October 2016 [hereafter referred as Won]; and for Claims 26-27, 35, and 39 under 35 U.S.C. 103 as being unpatentable over Honda in view of Bilenko, in further view of Graefe, in even further view of Brownlee, in even further view of Kaempf as applied to Claims 21, 32, and 37; in even further view of Chen, Hongge, Novel Machine Learning Approaches for Modeling Variations in Semiconductor Manufacturing (Masters Thesis), June 2017 [hereafter referred as Chen], Examiner acknowledges Applicant’s arguments and have considered them, and have found them to be not persuasive. Examiner notes that the majority of the Applicant’s arguments are directed to newly amended limitations in the amended claims which have not been previously presented, and thus necessitate further examination and re-evaluation of the amended and related original claims. The updated claim mappings according to the Applicant’s amended claims are provided in the relevant sections indicated below. However, Examiner has noted Applicant’s arguments contain certain broad assertions, which will be addressed in the following paragraphs.
Regarding Applicant’s Remarks:
“Regarding claim 21, the cited references, taken singly or in combination, fail to teach or suggest at least "dividing the sets of data into a plurality of groups, wherein each set of data is assigned to one group, wherein all sets of data for which the feature values meet at least one similarity criterion are in the same group, wherein the at least one similarity criterion comprises feature values for at least one feature not differing by more than a threshold amount between sets of data of the group." Support for the claim amendments may be found at, e.g., page 13, line 11. In rejecting similar features of previously presented claim 22, the OA cites paragraph [0053] of Bilenko: "the training system can group aspect values on the basis of shared aspects values ... the training system can ensure that all entries in a particular partition have at least one common aspect value (such as a particular user ID)". Applicant submits that the cited paragraph and Bilenko in general are silent on the amended features of claim 21, wherein the similarity criterion comprises feature values for at least one feature not differing by more than a threshold amount between sets of data of the group, in the manner recited.”
	Examiner has considered this argument, and finds the argument to be not persuasive. Examiner points out that Applicant’s argument is directed to a newly introduced limitation that was not previously presented (“wherein the at least one similarity criterion comprises feature values for at least one feature not differing by more than a threshold amount between sets of data of the group”), with a broad assertion that the Bilenko reference does not teach that particular limitation by merely indicating that the aspect ID value taught in Bilenko does not teach the amended limitation. Examiner reminds Applicant that MPEP 2111 requires that during patent examination, the pending claims must be given their broadest reasonable interpretation consistent with the specification, and an Examiner must construe claim terms in the broadest reasonable manner during prosecution as is reasonably allowed in an effort to establish a clear record of what applicant intends to claim. Examiner points out that under its broadest reasonable interpretation, the newly amended limitation broadly recites applying a similarity criterion where a feature does not differ by more than a threshold amount between data entries containing that feature. As indicated in the Non-Final Office Action mailed July 7, 2022, Bilenko [0049]-[0053] teaches other similarity criterions that can be used for grouping data entries into a same group, in addition to a shared or common aspect value (i.e., aspect ID). Bilenko [0050] teaches grouping aspect values based on a frequency measure in which different values occurs within the master dataset ([0050]: “… the training system can group aspect values based on a frequency measure … the training system can assess the frequency at which each aspect values occurs within the master dataset 110. The training system can then group together aspect values that have similar frequency values.”), while Bilenko [0051] teaches using hashing techniques to associate different partitions/groups with different hash buckets, where a hashing function can be applied to route a data entry containing a particular aspect value to a corresponding hash bucket ([0051]: “… the training system can group aspect values using a hashing technique … the training system can associate different partitions with different hash buckets. The training system can then apply a hashing function to route a particular aspect value to at least one of the hash buckets.”). A person having ordinary skill in the art would understand that grouping techniques such as calculating the frequency of a feature value and grouping data entries based on similar frequency values requires a determination of various frequency ranges in which each of the various data entries contain features exhibiting frequency values within a certain frequency range can be placed together in a same group, and hence represent grouping techniques based on a similarity criterion of applying features together that do not differ by a threshold amount (i.e., a frequency range). Similarly, a hashing technique can also be thought of as another type of similarity criterion that applies a hash function that produces a result based on a calculation that involves applying a threshold range for the feature value, and computing a hash function result that identifies and groups different data entries into different assigned hash buckets representing the hash function result. Examiner also points out that Applicant’s specification p.13 lines 20-26 teach an analogous form of similarity criterion that is consistent with the teachings found in Bilenko, where different data entries are grouped together based on different frequency percentages of a selected feature value (Applicant’s specification p.13 lines 20-26: “… in particular when the feature comprises continuous numerical values, identification of similar feature values can comprise binning the values into a limited number of groups (bins). … a division of the feature values into 10 bins can comprise Bin1 )=less than or equal to 10percentile), Bin2 (=greater than 10percentile, but less than or equal to 20percentile) …, and Bin10)=greater than 90percentile, but less than or equal to 100percentile) …”). Given the above evidence in view of the amended limitation, the Bilenko reference is still within scope of the Applicant’s claimed invention and still teaches the amended limitation as recited under its broadest reasonable interpretation. Hence, Applicant’s argument is not persuasive, and the existing prior art rejection is maintained.
Regarding Applicant’s Remarks:
“Further regarding claim 21, the cited references, taken singly or in combination, fail to teach or suggest at least "an aggregated representation of feature values for the sets of data of the group, wherein the aggregated representation of features values comprises an average of the feature values for the sets of data of the group. " The OA relies on paragraphs [0102]-[0109] and [0116]-[0119] of Honda as allegedly teaching related features of previously presented claim 21. Applicant respectfully disagrees. The cited paragraphs and Honda in general teach a process for binning chips into numeric bins, but Honda is silent on an aggregated representation of feature values comprising an average of the feature values for the sets of data of the group, in the manner recited.”
Examiner has considered this argument, and finds the argument to be not persuasive.
Examiner points out that Applicant’s argument is directed to a newly introduced limitation that was not previously presented (“wherein the aggregated representation of feature values comprises an average of the feature values for the sets of data of the group”), with a broad assertion that the Honda reference does not teach average of feature values as recited in the amended limitation. Examiner points out that under its broadest reasonable interpretation, the newly amended limitation broadly recites that the data entries within a group contain an average value as an aggregated representation of feature values. As indicated in the Non-Final Office Action mailed July 7, 2022, Honda [0102]-[0109], [0116]-[0119] teaches binning chips into numeric bins ranging from hardbin=1 to hardbin=n at the WS (wafer sort) and FT (final test) test levels, including the aggregation of WS and FT data features at the lot level, and assigning lot-level averages as features to individual chips (Honda [0104]-[0105]: Lot Level Aggregation at WS … All the above WS features can also [be] aggregated at the lot level and the lot-level averages are assigned as features to individual chips.”; and [0118]-[0119]: “Lot Level Aggregation at FT … All the above FT features can also be aggregated at the lot level and the lot-level averages can be assigned as features to individual chips.”). A person having ordinary skill in the art would understand that the term “lot level” broadly refers to another grouping involving sets of wafers containing a plurality of individual chips, such that the WS and FT measurement data collected at a chip level (collectively representing a data entry) can also be grouped as a set of data entries represented at a lot level, where each different lot level represents an aggregated set of data entries representing a subset of wafers containing individual chips. These lot-level averages are based on WS and FT features including the fractions of total counts of passed/failed chips per hardbin taught earlier by the Honda reference (Honda [0100]-[0103]: “… Fraction of Passing Chips Per Wafer at WS … Hardbin represents categorization of the health of the chip … assignment of hardbin=1 at WS can indicate that a chip passed testing at WS. The fraction of the chips passed at WS can be an indicator of high wafer quality. If majority are at hardbin > 1 at WS, it can indicate poor wafer quality. … Fraction of Each Hardbin Label Per Wa[f]er at WS … At WS chips can be binned into numeric bins ranging from WS hardbin=1 to WS hardbin=n … Hardbin is a code typically applied to a particular test result … hardbin=1 typically means that the chip passed the test, while hardbin > 1 typically means that the chip failed that test, for reasons which are indicated by the particular hardbin code … We can count the fraction of each hardbin label grouped by wafer, and assign that fraction as a feature to each chip on the wafer …”; and [0109]-[0117]: “At final test, chips can be assigned a FT hardbin. Hardbin=1 can indicate a pass at FT and Hardbin > 1 can indicate a FT fail. … Fraction of Passing Chips Per Wafer at FT … Just as with WS, an assignment of hardbin=1 can indicate a pass at FT and assignment of hardbin > 1 can indicate a FT fail. The fraction of the chips passed at FT can be an indicator of good wafer health … Fraction of Each Hardbin Label Per Wa[f]er at FT … At FT chips can be binned into numeric bins ranging from FT hardbin=1 to FT hardbin=n. We can count the fraction of each hardbin label grouped by wafer, and typically assign that fraction as a feature to each chip on the wafer …”). Hence, these lot-level averages are based on a plurality of data features associated with subsets containing individual chips (representing feature values for the sets of data of a respective group), where these lot-level averages are further assigned as additional features to the corresponding individual chip data entries, such that the creation and assignment of these lot-level average values as additional features (representing the lot-level aggregation of the respective features at the WS and FT test levels) correspond to storing an average value as a feature/attribute stored within the data entries belonging to a respective group, where this average value represents an aggregated representation of feature values. Given the above evidence in view of the amended limitation, the Honda reference still teaches the amended limitation as recited under its broadest reasonable interpretation. Hence, Applicant’s argument is not persuasive, and the existing prior art rejection is maintained.
Regarding Applicant’s Remarks:
“Further regarding claim 21, the cited references, taken singly or in combination, fail to teach or suggest at least "an aggregated representation of at least one label value of the sets of data of the group, wherein the aggregated representation of the at least one label value comprises a sum of label values for the sets of data of the group. " The OA relies on paragraphs [0102]-[0109] and [0116]-[0119] of Honda as allegedly teaching related features of previously presented claim 21. Applicant respectfully disagrees, as there is no mention or suggestion in Honda of an aggregated representation of a label value comprising a sum of label values for the sets of data of the group, in the manner recited.”
Examiner has considered this argument, and finds the argument to be not persuasive.
Examiner points out that Applicant’s argument is directed to a newly introduced limitation that was not previously presented (“wherein the aggregated representation of the at least one label value comprises a sum of label values for the sets of data of the group”), with a broad assertion that the Honda reference does not teach average of feature values as recited in the newly introduced limitation. Examiner points out that under its broadest reasonable interpretation, the newly introduced limitation broadly recites that the data entries within a group contain a label value representing a sum of label values as an aggregated representation of the label. As indicated in the Non-Final Office Action mailed July 7, 2022, Honda teaches pass/fail labels associated with the WS and FT features collected for individual chips, where these pass/fail labels determine good chips with a “pass” label and bad chips with a “fail” label (Honda [0102]-[0109]: “… At WS chips can be binned into numeric bins ranging from WS hardbin=1 to WS hardbin=n … Hardbin is a code typically applied to a particular test result … hardbin=1 typically means that the chip passed the test, while hardbin > 1 typically means that the chip failed that test … At final test, chips can be assigned a FT hardbin. Hardbin=1 can indicate a pass at FT and Hardbin > 1 can indicate a FT fail.”). Honda additionally teaches determining a number of chips that passed at the WS level as an indicator of the wafer quality, where this wafer quality indicator (representing poor or healthy wafer quality) that counts the number of chips that have an associated “pass” label at the WS test level represents a label value (i.e., poor or healthy wafer quality) that contains a value representing a sum of label values (i.e., the number of passed WS level chips) (Honda [0112]-[0113]: “Chip Count Per Wafer at FT … The number of chips that passed at the WS level can be an indicator of the wafer quality. If data for only a few chips passed from WS, it can be an indication of poor wafer quality … if most or all chips passed at WS, it can be an indication of a healthy wafer.”). Honda [0124] also teaches that the dataset modeling the measured features from each individual chip can be enhanced with the described enriched/engineered features and indicators, and hence this wafer quality indicator (representing poor or healthy wafer quality) is also included as a feature for each individual chip, such that this process of identifying and assigning the number of chips passed at the WS level as an indicator of poor or healthy wafer quality corresponds to a label value representing a sum of label values as an aggregated representation of the label (as recited in the Applicant’s newly introduced claim limitation). Given the above evidence in view of the newly introduced limitation, the Honda reference is still relevant with respect to Applicant’s newly introduced limitation, and is shown to teach the newly introduced limitation as recited under its broadest reasonable interpretation. Hence, Applicant’s argument is not persuasive, and the existing prior art rejection is maintained.
Regarding Applicant’s Remarks:
“Further regarding claim 21, the cited references, taken singly or in combination, fail to teach or suggest at least "randomizing the number of sets of data of each of the plurality of groups of the reduced training set of data a plurality of times in order to obtain a plurality of randomized reduced training sets of data, wherein randomizing the number of sets of data is performed based on a respective probability distribution characterizing a respective probability to obtain different values of the data representative of the number of sets of data of the respective group". The QA acknowledges that Honda in view of Bilenko, Graefe and Brownlee fail to explicitly teach this feature, and relies on pp 160-161 and 162-165 of Kaempf as allegedly teaching these features. Applicant respectfully disagrees. The cited pages and Kaempf in general teach to analyze random defects in a batch of wafers to determine if the defects are systematic or randomly distributed, but fails to teach or suggest randomizing a number of sets of data of a group of a reduced training set of data, in the manner recited. Specifically, claim 21 describes a process to randomize the number of sets of data associated with each of a plurality of groups a plurality of times according to a probability distribution, to obtain a plurality of randomized reduced trainings sets of data, which is entirely unrelated to the teachings of Kaempf.”
Examiner has considered this argument, and finds the argument to be not persuasive.
Examiner points out that Applicant’s above argument is directed to the newly introduced limitation in independent Claim 1 that was not previously entered. Examiner also points out that Applicant has also removed the earlier recited limitation in independent Claim 1 where the Kaempf reference was applied, and hence Applicant’s above arguments are also moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. The updated claim mapping for this newly introduced limitation in independent Claim 1 is provided in the relevant sections indicated below.
Regarding Applicant’s Remarks:
“Thus, for at least the above reasons, Applicant submits that the cited art fails to anticipate or disclose all the features and limitations of claim 21, and so claim 21, and those claims respectively dependent therefrom, are patentably distinct and non-obvious over the cited art, and are thus allowable. Independent claims 32 and 37 recite similar features to claim 21, and for the same reasons these claims, as well as their respective dependent claims, are also allowable.”
Examiner has considered this argument, and finds the argument to be not persuasive.
Examiner notes that Applicant does not provide any additional arguments other than referencing Applicant’s previous set of arguments made for the limitations recited in independent Claim 21. As established in response to the previous set of arguments in the above paragraphs, Applicant’s arguments concerning the identified limitations in independent Claim 21 were not persuasive, and hence Applicant’s arguments for the same limitations present in independent Claims 32 and 37 are also not persuasive, and thus the prior art rejections are maintained.
As noted above, Applicant’s amended claim limitations that were not presented earlier necessitates further examination and re-evaluation of the amended and related original claims. The updated claim mappings according to the Applicant’s amended claims are provided in the relevant sections indicated below. 

Specification
The disclosure is objected to because of the following informalities: 
Applicant’s specification p.21 lines 9-10: The following lines contain a typographical error: “… Other algorithms include regression model, bagging model, decision tree model, associate rule model, neural network model, etc.”. Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 21-40 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite 
for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Regarding Claims 21, 32, and 37,
All three claims recite the term “reduced training set of data” in the context of the following amended and newly introduced limitations:
“… build[ing] a reduced training set of data which comprises an aggregated representation of the training set of data, the building comprising: 
dividing the sets of data into a plurality of groups, wherein each set of data is assigned to one group, wherein all sets of data for which the feature values meet at least one similarity criterion are in the same group, wherein the at least one similarity criterion comprises feature values for at least one feature not differing by more than a threshold amount between sets of data of the group; 
storing in the reduced training set of data for each group an aggregated set of data comprising:
an aggregated representation of feature values for the sets of data of the group, wherein the aggregated representation of feature values comprises an average of the feature values for the sets of data of the group,
an aggregated representation of at least one label value of the sets of data of the group, wherein the aggregated representation of the at least one label value comprises a sum of label values for the sets of data of the group,
a number of sets of data of the respective group; …”, 
where Applicant’s usage of the term “reduced training set of data” renders the claim as being indefinite with respect to the recited amended and newly introduced limitations describing the building or generation of the training set, as the latest set of newly introduced and amended limitations do not exhibit any reduction, deletion, or removal of data features/attributes or data entries being performed during the building or storing of this training set of data that would support the recited limitation that a reduced training set of data is being built. Examiner points out that the subsequent amended and newly introduced limitations that further describe the building of this reduced training set of data (“dividing the sets of data into a plurality of group, wherein all sets of data for which the feature values meet at least one similarity criterion are in the same group, wherein the at least one similarity criterion comprises feature values for at least one feature not differing by more than a threshold amount between sets of data of the group”) broadly indicates grouping all sets of data (i.e., data entries) into a plurality of groups based on a similarity criterion, which does not recite any reduction, deletion, or removal of data features or data entries that would result in a reduced training set of data. Similarly, Examiner further points out that the terms “aggregated representation of the training set of data” and “aggregated set of data” broadly indicate a grouping of data entries, while the terms “aggregated representation of feature values” and “aggregated representation of at least one label value” broadly indicate that one or more data entries within a group include respective data features/attributes representing an average value or a sum of label values, respectively, and hence Examiner also finds that these recited terms do not describe any reduction, deletion, or removal of data features or data entries that would result in the generation of a reduced training set of data. Examiner points out that Applicant’s specification p.15 lines 12-16 and Figure 4 indicates examples of aggregated representations (average, median, sum, variance, most frequently observed values, etc.), where these aggregated representations are stored in a “reduced” training set of data (Applicant’s specification p.15 lines 12-16: “… for a given group G, an aggregated set of data can be stored in the reduced training set of data, which comprises an aggregation of the feature values of the sets of data belonging to this given group. The aggregation can rely on a mathematical function (e.g., statistical functions) such as average, median, sum, variance, mode i.e., the most frequently observed value, or other functions.”). The storage of these aggregated representations is interpreted as adding these additional calculated features/attributes to a “reduced” training set of data (thereby increasing the amount of data stored for each data entry), and therefore does not recite any reduction, deletion, or removal of data features or data entries that would result in the generation of a reduced training set of data. As a further example, Examiner also points out that Applicant’s Figure 4 contains an aggregated field “Data count” that also exists in Applicant’s Figure 14A, where Applicant’s specification p.17 lines 10-11 describe this field as a calculated feature representing a number of sets of data corresponding to Applicant’s amended limitation “a number of sets of data of the respective group” (Applicant’s specification p.17 lines 10-11: “A counter 416 is determined in this particular example as the number of sets of data based on which the aggregated set of data was built for a group.”), where this feature is added into each data entry shown in Figures 4 and 14A (thereby increasing the amount of data stored for each data entry), and therefore does not demonstrate any reduction, deletion, or removal of data features that would result in the generation of a reduced training set of data. Therefore, Examiner notes it is unclear where any reduction is being performed in the recited limitations that would produce a “reduced training set of data”, given that none of the recited limitations describe any action or series of steps that would represent reduction, removal, or deletion of feature data or data entries under its broadest reasonable interpretation, thus rendering the term “reduced training set of data” as being indefinite. Examiner requests that the Applicant further amend the limitations in the respective claims to make it clear if the intention is to claim a “reduced training set of data” or not. If the intention is to claim an “aggregated representation of the training set of data”, Applicant should remove the term “reduced training set of data” from the independent and related dependent claims. If the intention is to claim a “reduced training set of data”, Applicant should amend the respective limitations such that they recite steps that reduce or remove/delete data features/attributes such that they correspond to steps of building or generating a “reduced training set of data”. For the purposes of examination, Examiner will treat the term “reduced training set of data” recited in the claims as an alternate name for an “aggregated representation of the training set of data”, where these aggregated representations of data are provided as additional features to the training set.
Additionally, all three claims recite the following amended and newly introduced limitations (“dividing the sets of data into a plurality of groups, wherein each set of data is assigned to one group, wherein all sets of data for which the feature values meet at least one similarity criterion are in the same group, …”), which further renders the claim as being indefinite, as the first limitation (“dividing the sets of data into a plurality of groups”) appears to broadly indicate all data entries associated with the training set are grouped in a plurality of groups, but the second and third limitations (“wherein each set of data is assigned to one group”, and “wherein all sets of data for which the feature values meet at least one similarity criterion are in the same group”) appear to broadly indicate that each data entry is assigned to a singular group according to a similarity criterion that is used for all sets of data entries in the training set, hence contradicting the first limitation of determining a plurality of groups containing respective data entries. Examiner points out that this contradiction makes it unclear whether the training set contains one large singular group or a plurality of groups. In other words, it is unclear whether these amended limitations collectively are reciting a training set containing a plurality of groups (with each group containing respective data entries that are grouped together based on at least one similarity criterion applied to each of the plurality of groups), or a training set containing a single group (with all data entries grouped together based on at least one similarity criterion applied to the same group). Applicant’s Remarks points to Applicant’s specification p.13 line 11 as providing support for the amended limitations indicating a plurality of groups, where lines 10-11 contains the description based on Applicant’s Figure 3A (“… dividing the sets of data into a plurality of groups. Each set of data is assigned to a group.”). However, Examiner also points out that Applicant’s Figure 3A only shows an example of a set of data entries grouped into a single group (as denoted by the bracket notation in Figure 3A), thus still making it unclear whether the Applicant intends to apply the at least one similarity criterion to form a single group of data entries, or apply respective similarity criterions to form a plurality of respective groups of data entries. Examiner requests that the Applicant amend the limitations to make it clear whether the claims are reciting a training set containing a plurality of groups, with each group containing respective data entries that are grouped together based on at least one similarity criterion applied to each of the plurality of groups, or reciting a training set containing a single group, with all data entries grouped together based on at least one similarity criterion applied to the same group. For the purposes of examination, Examiner will interpret this limitation in the context of the prior art.
Claims 22-31 are dependent claims tracing back to parent independent Claim 21, and as such inherits the same indefiniteness issues established in Claim 21. Hence, Claims 22-31 are also rejected as being indefinite by virtue of dependency.
Claims 33-36 are dependent claims tracing back to parent independent Claim 32, and as such inherits the same indefiniteness issues established in Claim 32. Hence, Claims 33-36 are also rejected as being indefinite by virtue of dependency.
Claims 38-40 are dependent claims tracing back to parent independent Claim 37, and as such inherits the same indefiniteness issues established in Claim 37. Hence, Claims 38-40 are also rejected as being indefinite by virtue of dependency.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 21-27, 29, and 31-40 are rejected under 35 U.S.C. 103 as being unpatentable over 
Honda et al., U.S. PGPUB 2019/0277913, filed 3/8/2019 [hereafter referred as Honda], in view of Bilenko et al., U.S. PGPUB 2014/0337096, published 11/13/2014 [hereafter referred as Bilenko], in  further view of Brownlee, Jason, Bagging and Random Forest Ensemble Algorithms for Machine Learning, retrieved from web.archive.org dated June 25, 2019 [hereafter referred as Brownlee], in even further view of Chen, Hongge, Novel Machine Learning Approaches for Modeling Variations in Semiconductor Manufacturing (Masters Thesis), June 2017 [hereafter referred as Chen].
Regarding amended Claim 21, 
Honda teaches
(Currently Amended) A method comprising, by a processing unit and a memory coupled to a non-transitory computer-readable memory medium (Examiner’s note: Honda teaches preparing a dataset based on measurements involving testing individual chips located on wafers in a semiconductor manufacturing process, where these measurements include measured feature data and engineered/enriched data produced from the different WAT/PCM, WS, CP, FT test levels, where the WS and FT test levels produces hardbin labels indicating passed/good chips and failed/bad chips. Honda teaches this dataset is prepared as part of a machine learning pipeline that generates and stores this dataset as training data for one or more predictive models. A person having ordinary skill in the art would understand that performing the set of process steps and machine learning pipeline requires a computing system containing a processor and associated memory (e.g., RAM and disk storage) coupled to each other, where the associated memory stores computer instructions representing these process steps and associated machine learning pipeline to execute the process steps, where the process steps include storing the input data as well as all outputs resulting from the process steps and machine learning pipeline, and where the outputs include predictions produced from a machine learning model (Honda Figure 3 and [0035]-[0041]; Figure 7 and [0059]; [0124]; and Figures 12-14).): …
… obtaining a training set of data comprising a plurality of sets of data each representative of an electronic item, each set comprising feature values for a plurality of features, and for at least one label (Examiner’s note: Under its broadest reasonable interpretation, the phrase “plurality of sets of data each representative of an electronic item, each set comprising feature values for a plurality of features” broadly indicates a plurality of data entries, with each data entry containing a plurality of features/attributes collectively representing an electronic item, and hence this limitation broadly recites generating a training data set including a plurality of data entries, with each data entry containing a plurality of features/attributes and labels collectively representing an electronic item. Honda teaches preparing a dataset based on measurements involving testing individual chips located on wafers in a semiconductor manufacturing process, where these measurements include measured feature data and engineered/enriched data produced from the different WAT/PCM, WS, CP, FT test levels, where the WS and FT test levels produces hardbin labels indicating passed/good chips and failed/bad chips. Honda teaches this dataset is prepared as part of a machine learning pipeline that generates and stores this dataset as training data for one or more predictive models, and hence this process of preparing a dataset containing semiconductor chip measurements for use in one or more predictive models corresponds to a process of generating a training data set including a plurality of data entries, with each data entry containing a plurality of features/attributes and labels collectively representing an electronic item (Honda Figure 1; [0022]-[0025]; [0088]-[0089]; [0096]-[0097]: “WS Data Enrichment … The WS data can contain hundreds or thousands of measured fields (a typical example is 250 measurement fields) which can be persistent week-over-week.”, [0098]-[0105], [0106]-[0107]: “FT Data Enrichment … The FT data can contain up to hundreds or thousands of measurements (a typical example is about 50 measurement fields) which are persistent week-over-week.”; [0124]: “The dataset can be prepared for modeling by assigning to each chip the raw measurement fields from PCM, WS and FT as well as augmenting the raw fields with the engineered/enriched features as described above. This can be about 1500-2000 features (or predictor variables) per chip.”; and [0136]-[0144]).) …
… building a reduced training set of data which comprises an aggregated representation of the training set of data (Examiner’s note: As indicated earlier, the term “reduced training set of data” exhibits a 112(b) indefiniteness issue, and hence for the purposes of examination, this limitation will be interpreted as building an aggregated representation of the training set of data, which is then defined as a “reduced training set of data”, where these aggregated representations of data are provided as additional features to the training set. As indicated earlier, Honda teaches a machine learning pipeline for generating and storing a training data set for one or more predictive models. The machine learning pipeline includes various stages for performing feature data cleansing, feature selection, and feature engineering on the WS and FT measurements collected for individual chips, where the set of measurements for each individual chip represents a set of data/data entry, and the set of measurements for a plurality of individual chips represents a plurality of sets of data/plurality of data entries. Honda further teaches the feature engineering on the data entries involves enriching and augmenting the existing measurements by performing statistical methods on the feature data (minimum, maximum and 10-90 percentile range), as well as performing lot level aggregation of the chips at the WS and FT test levels, identifying various numeric bins (hardbins) in which each chip can be categorized (with hardbin=1 indicating a passed/good chip, and hardbin > 1 indicating a failed/bad chip, with the assigned hardbin number corresponding to a failure reason code), and determining poor or healthy wafer quality based on a number of passed chips at the WS test level. A person having ordinary skill in the art would understand that performing statistical methods to determine lot level aggregation, identify different bins for categorizing passed/good and failed/bad chips, and determine poor or healthy wafer quality based on a number of passed chips are various ways of grouping the set of measurements associated with each individual chip, and hence these techniques of grouping the different data entries representing the set of chips and enriching and augmenting the existing measurements with additional feature data corresponds to a process for building an aggregated representation of the training set of data, which is defined as a “reduced training set of data” (Honda Figure 3, elements 310,320, 330; and [0035]-[0041], [0047]-[0051]; Figure 7 and [0059]; [0091]-[0122]; and [0123]-[0130]).) …
… storing in the reduced training set of data, for each group, an aggregated set of data comprising: 
an aggregated representation of feature values for the sets of data of the group, wherein the aggregated representation of feature values comprises an average of the feature values for the sets of data of the group (Examiner’s note: Under its broadest reasonable interpretation, this limitation broadly recites storing an average value as a feature/attribute stored within the data entries belonging to a respective group, where this average value represents an aggregated representation of feature values. As indicated earlier, Honda teaches enriching and augmenting existing WS and FT measurements by identifying numeric bins ranging from hardbin=1 to hardbin=n at the WS and FT test levels, where each bin contains a count of respective chips, and each bin represents a label indicating passed/good chips (hardbin=1) or failed/bad chips with an assigned hardbin failure reason code (hardbin > 1). These associated hardbin counts are further used to identify additional features such as fractions of total counts of passed/failed chips per hardbin, which can be further aggregated at the lot level at WS and FT test levels. These lot-level averages (based on the features present in the corresponding chip data entries) are then assigned as additional features to the corresponding individual chips, such that the creation and assignment of these lot-level average values as an additional feature (representing the lot-level aggregation of the respective features at the WS and FT test levels) correspond to storing an average value as a feature/attribute stored within the data entries belonging to a respective group, where this average value represents an aggregated representation of feature values (Honda [0100]-[0105]: “… Fraction of Passing Chips Per Wafer at WS … Hardbin represents categorization of the health of the chip … assignment of hardbin=1 at WS can indicate that a chip passed testing at WS. The fraction of the chips passed at WS can be an indicator of high wafer quality. If majority are at hardbin > 1 at WS, it can indicate poor wafer quality. … Fraction of Each Hardbin Label Per Wa[f]er at WS … At WS chips can be binned into numeric bins ranging from WS hardbin=1 to WS hardbin=n … Hardbin is a code typically applied to a particular test result … hardbin=1 typically means that the chip passed the test, while hardbin > 1 typically means that the chip failed that test, for reasons which are indicated by the particular hardbin code … We can count the fraction of each hardbin label grouped by wafer, and assign that fraction as a feature to each chip on the wafer …  Lot Level Aggregation at WS … All the above WS features can also [be] aggregated at the lot level and the lot-level averages are assigned as features to individual chips.”; and [0109]-[0119]: “At final test, chips can be assigned a FT hardbin. Hardbin=1 can indicate a pass at FT and Hardbin > 1 can indicate a FT fail. … Fraction of Passing Chips Per Wafer at FT … Just as with WS, an assignment of hardbin=1 can indicate a pass at FT and assignment of hardbin > 1 can indicate a FT fail. The fraction of the chips passed at FT can be an indicator of good wafer health … Fraction of Each Hardbin Label Per Wa[f]er at FT … At FT chips can be binned into numeric bins ranging from FT hardbin=1 to FT hardbin=n. We can count the fraction of each hardbin label grouped by wafer, and typically assign that fraction as a feature to each chip on the wafer … Lot Level Aggregation at FT … All the above FT features can also be aggregated at the lot level and the lot-level averages can be assigned as features to individual chips.”).),
an aggregated representation of at least one label value of the sets of data of the group, wherein the aggregated representation of the at least one label value comprises a sum of label values for the sets of data of the group (Examiner’s note: Under its broadest reasonable interpretation, this limitation broadly recites storing a label feature/attribute represented by a sum of label values for a group of data entries. As indicated earlier, Honda teaches enriching and augmenting existing WS and FT measurements by identifying numeric bins ranging from hardbin=1 to hardbin=n at the WS and FT test levels, where each bin contains a count of respective chips, and each bin represents a label indicating passed/good chips (hardbin=1) or failed/bad chips with an assigned hardbin failure reason code (hardbin > 1). These associated hardbin counts are further used to identify additional features such as fractions of total counts of passed/failed chips per hardbin, where these fractions of total counts of passed/failed chips per hardbin represent a sum of total chips for each hardbin label, which corresponds to a label feature/attribute represented by a sum of label values (i.e., the count of passed/failed chips per hardbin label). Honda additionally teaches determining a number of chips that passed at the WS level as an indicator of the wafer quality, where this wafer quality indicator (representing poor or healthy wafer quality) that counts the number of chips that have an associated “pass” label at the WS test level represents a label value (i.e., poor or healthy wafer quality) that contains a label feature/attribute represented by a sum of label values (i.e., the number of passed WS level chips). As indicated earlier, Honda [0124] teaches that the dataset modeling the measured features from each individual chip can be enhanced with the described enriched/engineered features and indicators, and hence this wafer quality indicator (representing poor or healthy wafer quality) is also included as a feature for each individual chip. Hence, this process of identifying and assigning the fraction of a total number of chips at each hardbin label and the number of chips passed at the WS level as an indicator of poor or healthy wafer quality corresponds to a process for storing different types of label feature/attributes represented by a sum of label values for a group of data entries (Honda [0100]-[0105]; [0112]-[0113]: “Chip Count Per Wafer at FT … The number of chips that passed at the WS level can be an indicator of the wafer quality. If data for only a few chips passed from WS, it can be an indication of poor wafer quality … if most or all chips passed at WS, it can be an indication of a healthy wafer.”; [0109]-[0119]; and [0124]).),
a number of sets of data of the respective group (Examiner’s note: Under its broadest reasonable interpretation, this limitation broadly recites storing a feature/attribute represented by a number of data entries for a respective group. As indicated earlier, Honda teaches recording chip counts per wafer at both WS and FT test levels, where these chip counts represent the number of chips at each WS and FT test level per wafer (where a wafer represents a grouping of individual chips), and thus this process that records the chip counts per wafer at both WS and FT test levels corresponds to storing a feature/attribute represented by a number of data entries for a respective group (Honda [0098]-[0099] and [0112]-[0113]).) …
… randomizing … of each of the plurality of groups of the reduced training set of data … in order to obtain a plurality of randomized reduced training sets of data (Examiner’s note: Under its broadest reasonable interpretation in light of Applicant’s specification p.25 lines 7-23, the term “randomizing” each of the plurality of groups in order to obtain a randomized plurality of groups broadly indicates a random sampling process, and hence this limitation broadly recites performing random sampling on each plurality of groups to generate the randomized plurality of groups. Honda teaches applying machine learning techniques including bagging on a dataset to avoid having a skewed dataset that exhibits a bias towards a majority class. A person having ordinary skill in the art would understand the term “bagging” to refer to “bootstrap aggregation”, which is a known term of art, and is a process that involves randomizing a dataset with replacement to produce a plurality of randomized training sets, where this aspect of performing random sampling with replacement using a bagging algorithm produces a plurality of randomized training sets of data (Honda [0132]-[0135]: “… Oversampling methods can be the following: 1. Random oversampling. 2. Synthetic minority oversampling technique (SMOTE). 3. Bagging.”).) …
… using the plurality of randomized reduced training sets of data in a classification algorithm implementing a plurality of decision trees for determining a relationship between the at least one label and the features of the electronic items (Examiner’s note: As indicated earlier, the term “reduced training set of data” exhibits a 112(b) indefiniteness issue, and hence for the purposes of examination, this limitation will be interpreted as referencing the earlier recited aggregated representation of the training set of data, where these aggregated representations of data are provided as additional features to the training set. Thus, under its broadest reasonable interpretation in light of Applicant’s specification p.21 lines 5-10, this limitation broadly recites usage of a classification algorithm (including bagging) to implement a plurality of decision trees, using the aggregated representation of the training set of data, where these aggregated representations of data are provided as additional features to the training set. As indicated earlier, Honda teaches preparing a dataset as part of a machine learning pipeline for used in one or more predictive models, where this dataset is generated based on measurements that include measured and engineered/enriched semiconductor chip feature data produced from the WAT/PCM, WS, CP, FT test levels, where the WS and FT test levels produces hardbin labels indicating passed/good chips and failed/bad chips (Honda Figure 1; [0022]-[0025]; [0088]-[0089]; and [0091]-[0122]; Honda Figure 3, elements 310,320, 330; and [0035]-[0041], [0047]-[0051]; Figure 7 and [0059]; [0123]-[0130]). As indicated earlier, Honda teaches a bagging algorithm, where a person having ordinary skill in the art would understand that a bagging algorithm generates a plurality of randomized training sets of data (Honda [0132]-[0135]). Honda Figures 12-14 further teaches applying the generated training data set into various model architectures implementing various machine learning algorithms (including random forests and decision trees) to generate classification results for the RMA’ed chips (representing failures during SLT) that correspond to failure codes indicating the subsystem/stage in which the failure occurred. A person having ordinary skill in the art would understand a random forest algorithm represents a classification algorithm containing a plurality of decision trees, and hence the usage of a random forest algorithm on a training data set containing measured and engineered/enriched semiconductor chip feature data produced from the WAT/PCM, WS,CP, FT test levels in order to identify an association between those features present in RMA’ed chips and associated failure codes (representing at least one label) corresponds to a usage of a classification algorithm to implement a plurality of decision trees, using the generated plurality of randomized training sets of data to determine relationships between at least one label and associated features of the electronic items (Honda [0123]-[0137], in particular [0136]-[0137]: “… Classification of RMA Error Codes … An RMA can be viewed as a failure of an independent SLT … Some of the failures can be captured from PCM, WS and FT data … It therefore can be necessary to classify the RMA’ed chips into failure codes indicating the subsystem that failed …”; and Figures 12-14, [0140]: “… One possible single-level model architecture … inputs the raw and engineered features into a machine learning model and output RMA probability. The machine learning algorithms used for this step can be parametric or nonparametric. Logistic regression, support vector machines, and neural networks are examples of parametric models that can be used. Decision trees, Random Forests, Gradient Boosted Machines and nearest neighbor classifiers are examples of non-parametric model that can be used.”).) …
… wherein each randomized reduced training set of data is used in … the classification algorithm for determining the relationship between the at least one label and the features of the electronic items (Examiner’s note: As indicated earlier, Honda teaches preparing a dataset as part of a machine learning pipeline for used in one or more predictive models, where this dataset is generated based on measurements that include measured and engineered/enriched semiconductor chip feature data produced from the WAT/PCM, WS, CP, FT test levels, where the WS and FT test levels produces hardbin labels indicating passed/good chips and failed/bad chips (Honda Figure 1; [0022]-[0025]; [0088]-[0089]; and [0091]-[0122]; Honda Figure 3, elements 310,320, 330; and [0035]-[0041], [0047]-[0051]; Figure 7 and [0059]; [0123]-[0130]). Honda Figures 12-14 further teaches applying the generated training data set into various model architectures implementing various machine learning algorithms (including random forests and decision trees) to generate classification results for the RMA’ed chips (representing failures during SLT) that correspond to failure codes indicating the subsystem/stage in which the failure occurred. A person having ordinary skill in the art would understand a random forest algorithm represents a classification algorithm containing a plurality of decision trees, and hence the usage of a random forest algorithm on a training data set containing measured and engineered/enriched semiconductor chip feature data produced from the WAT/PCM, WS,CP, FT test levels in order to identify an association between those features present in RMA’ed chips and associated failure codes (representing at least one label) corresponds to a usage of a classification algorithm to implement a plurality of decision trees, using the generated plurality of randomized training sets of data to determine relationships between at least one label and associated features of the electronic items (Honda [0123]-[0137], in particular [0136]-[0137]; and Figures 12-14, [0140]).) …
… storing an output of the classification algorithm in the non-transitory computer-readable memory medium (Examiner’s note: As indicated earlier, Honda teaches a machine learning pipeline for processing a training data set collected from data sources during a semiconductor manufacturing process, and applying the generated training data set to multiple predictive models for failure detection and classification. A person having ordinary skill in the art would understand that performing the set of steps in the machine learning pipeline requires a computing system containing a processor and associated memory (e.g., RAM and disk storage) coupled to each other, where the associated memory contains computer instructions representing these process steps and associated machine learning pipeline to execute the process steps, and where the steps include storing the training data set as well as all outputs resulting from the process steps and machine learning pipeline in order to produce the outputs for a machine learning model (with the outputs marked as “predictions” in Honda Figures 12-14; Figure 3 and [0035]-[0041]; Figure 7 and [0059]).).
While Honda teaches feature engineering and bagging techniques to produce a plurality of randomized groupings of training data, Honda does not explicitly teach
… the building comprising: dividing the sets of data into a plurality of groups, wherein each set of data is assigned to one group, wherein all sets of data for which the feature values meet at least one similarity criterion are in the same group, wherein the at least one similarity criterion comprises feature values for at least one feature not differing by more than a threshold amount between sets of data of the group …
Bilenko teaches
… the building comprising: dividing the sets of data into a plurality of groups, wherein each set of data is assigned to one group, wherein all sets of data for which the feature values meet at least one similarity criterion are in the same group, wherein the at least one similarity criterion comprises feature values for at least one feature not differing by more than a threshold amount between sets of data of the group (Examiner’s note: As indicated earlier, this limitation exhibits a 112(b) indefiniteness issue, and hence for purposes of examination, this limitation is interpreted as broadly reciting a training set containing a plurality of groups, with each group containing respective data entries that are grouped together based on at least one similarity criterion applied to each of the plurality of groups, where the at least one similarity criterion is used to group features that do not differ by more than a threshold amount between data entries containing that feature. Bilenko teaches a master data set containing a plurality of training examples, where each training example contains aspect values and a label, where these aspect values describe different event characteristics, thus representing features for each training example (Bilenko [0032]-[0033]). Bilenko further teaches a partitioning process to produce training set instances that represent different clusters or partitions, where the partitioning process divides the master data set into multiple partitions according to the feature/attribute values identified within each training example. Bilenko further teaches different techniques for grouping data entries into a group, such as grouping aspect values based on a frequency measure in which different values occurs within the master dataset, as well as grouping aspect values using hashing techniques to associate different partitions/groups with different hash buckets, where a hashing function can be applied to route a data entry containing a particular aspect value to a corresponding hash bucket. A person having ordinary skill in the art would understand that techniques such as calculating the frequency of a feature value and grouping data entries based on similar frequency values requires a determination of various frequency ranges in which each of the various data entries contain features exhibiting frequency values within a certain frequency range can be placed together in a same group, and hence represent grouping techniques based on a similarity criterion of applying features together that do not differ by a threshold amount (i.e., a frequency range). Similarly, a hashing technique can also be thought of as another type of similarity criterion that applies a hash function that produces a result based on a calculation that involves applying a threshold range for the feature value, and computing a hash function result that identifies and groups different data entries into different assigned hash buckets representing the hash function result (Bilenko Figure 1, elements 110, 112; [0033], [0036]-[0037]: “… a partitioning process 112 produces a plurality of partitions (also referred to as bins) for each aspect under consideration. Each partition is associated with a set of aspect values. … The partitioning process performs a similar partitioning process for other aspects, include both aspects associated with single attributes and aspects associated with combinations of attributes.”; and [0049]-[0053]: “… the training system produces clusters of aspect values, where the aspect values in each cluster exhibit similar label-conditioned statistical profiles. … the training system can use other partitioning strategies to produce the partitions … the training system can group aspect values based on a frequency measure … the training system can assess the frequency at which each aspect values occurs within the master dataset 110. The training system can then group together aspect values that have similar frequency values. … the training system can group aspect values using a hashing technique … the training system can associate different partitions with different hash buckets. The training system can then apply a hashing function to route a particular aspect value to at least one of the hash buckets. …”).) …
	Both Honda and Bilenko are analogous art since they both teach techniques performed on a training data set containing data entries to produce a plurality of groups of training data entries.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the feature engineering and bagging methods taught in Honda and incorporate the data grouping methods taught in Bilenko as a way to group similar training data entries representing a plurality of groups. The motivation to combine is taught in Bilenko, since these generated groups represent groups of optimized training data entries that when applied to a machine learning model, minimizes the loss of predictive accuracy, as well as reducing the computation time to process the data in a machine learning model, thereby allowing a system to provide prediction results without sacrificing prediction accuracy (Bilenko [0048]-[0049]: “The training system may use different partitioning strategies to define partitions. In one approach, the training system can define partitions in a manner that satisfies an objective relating to loss of predictive accuracy. … Predictive accuracy refers to an extent to which the statistical information accurately represents the labels associated with individual training examples which contribute to the statistical information. … The training system minimizes the loss of predictive accuracy by performing clustering in such a manner that the instances of statistical information accurately characterize the members in the respective clusters, while minimizing, overall, the loss of descriptive information pertaining to specific members of the clusters.”).
While Honda in view of Bilenko teaches applying classification algorithms such as random forest and bagging techniques to produce randomized groups of training data, Honda in view of Bilenko does not explicitly teach
… each randomized reduced training set of data is used in a different decision tree of the plurality of decision trees … based on respective outputs of the plurality of decision trees …
Brownlee teaches
… each randomized reduced training set of data is used in a different decision tree of the plurality of decision trees … based on respective outputs of the plurality of decision trees (Examiner’s note: Under its broadest reasonable interpretation, this limitation broadly recites applying a plurality of randomized training sets of data into a plurality of decision trees, where each randomized training set of data is applied to a different decision tree to determine an association/relationship between at least one label output and associated features. Brownlee teaches bootstrap aggregation (bagging) as an machine learning ensemble method applied to decision trees and random forests, where the bootstrap aggregation method applies samples that were randomly selected (with replacement) to a plurality of different decision trees (with each decision tree performing classification), with the number of decision trees corresponding to the number of generated random samples. Brownlee further teaches the respective outputs/predictions from the plurality of decision trees after applying the random samples are used to determine variable importances between a label output/prediction and the features, such that these variable importances represent relationship between the label output and the features (Brownlee p.1 2nd bullet: “… The Bootstrap Aggregation algorithm for creating multiple different models from a single training dataset.”; p.2 1st-6th paragraphs: “Bootstrap Aggregation (or Bagging for short) is a simple an very powerful ensemble method … Bootstrap Aggregation is a general procedure that can be used to reduce the variance for those algorithm that have high variance. … Bagging is the application of the Bootstrap procedure to a high-variance machine learning algorithm, typically decision trees. … Bagging of the CART algorithm would work as follows. 1. Create many (e.g., 100) random sub-samples of our dataset with replacement. 2. Train a CART model on each sample. 3. Given a new dataset, calculate the average prediction from each model. … if we had 5 bagged decision trees that made the following class predictions for [[a in]] an input sample: blue, blue, red, blue and red, we would take the most frequent class and predict blue. … The only parameters when bagging decision trees is the number of samples and hence the number of trees to include.”; p.2 Random Forest: “Random Forests are an improvement over bagged decision trees. … The random forest algorithm changes this procedure so that the learning algorithm is limited to a random sample of features of which to search. …”; and p.2 Variable Importance: “As the Bagged decision trees are constructed, we can calculate how much the error function drops for a variable at each split point … These drops in error can be averaged across all decision trees and output to provide an estimate of the importance of each input variable. The greater the drop when the variable was chosen, the greater the importance. These outputs can identify subsets of input variables that may be most or least relevant to the problem and suggest at possible feature selection experiments you could perform where some features are removed from the dataset.”).) …
Both Honda in view of Bilenko and Brownlee are analogous art since both teach machine learning techniques involving the bagging/bootstrap aggregation method.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the bagging/bootstrap aggregation techniques taught in Honda in view of Bilenko and apply these techniques to a random forest model taught in Brownlee as a way to identify and determine important or relevant variables and features for further tuning and improvement to the classification model. The motivation to combine is taught in Brownlee, since determining variable importances are useful in identifying subsets of input variables that may be most or least relevant to the problem, thus allowing a way to identify and tune the model by removing certain features from the dataset, resulting in improved performance and accuracy of the classification model (Brownlee p.2 Variable Importance).
While Honda in view of Bilenko, in further view of Brownlee teaches applying feature engineering methods to create new features to be added into each data entry within a respective group of data entries, Honda in view of Bilenko, in further view of Brownlee does not explicitly teach
… randomizing the number of sets of data … a plurality of times … wherein randomizing the number of sets of data is performed based on a respective probability distribution characterizing a respective probability to obtain different numbers of sets of data of the respective group …
Chen teaches
… randomizing the number of sets of data … a plurality of times … wherein randomizing the number of sets of data is performed based on a respective probability distribution characterizing a respective probability to obtain different numbers of sets of data of the respective group (Examiner’s note: Under its broadest reasonable interpretation, this limitation broadly recites obtaining different values based on performing random sampling based on a probability distribution characterizing a respective probability, where the different values are represented by a value corresponding to a number of sets of data (i.e., a count). Chen teaches performing random sampling on a large number of testing dies (“randomizing … a plurality of times”) to determine an expected number of good packages based on the probability of frequency of occurrence of good packages, where the random sampling involves randomly packaging the dies into packages or stacks with s number of dies into a stack to determine a probability distribution of good packages that is based on a binomial distribution and a conditional probability distribution of expected good packages containing s dies given the predicted number of good packages containing s dies                         
                            
                                
                                    p
                                    (
                                    H
                                    =
                                    0
                                    |
                                    y
                                    =
                                    0
                                    )
                                
                                
                                    s
                                
                            
                        
                    . This conditional probability represents the frequency of occurrence good s dies within a set of ‘good packages’, and hence represents a number or count of sets of data for the group of data that is labeled as ‘good packages’. Thus, the calculated expected number of good packages represents the different values of the numbers of ‘good packages’ (representing randomized values of different numbers of sets of data) based on a binomial distribution, where these different values are obtained based the frequency of occurrence of the number of good s dies in a ‘good package’ group (Chen pp.48-50 Section 4.4 Mathematical Formulation: “… With a large number of testing dies, we can estimate the underlying probability by the relative frequency. We denote “positive” or fail by 1, and “negative” or pass by 0. … Without any classifiers, if we randomly package the dies into packages or stacks with s die in the stack, the failure rate of the packages is                         
                            
                                
                                    
                                        
                                            p
                                        
                                        
                                            p
                                            a
                                            c
                                            k
                                            a
                                            g
                                            e
                                             
                                            f
                                            a
                                            i
                                            l
                                        
                                    
                                    =
                                    1
                                    -
                                    p
                                    (
                                    H
                                    =
                                    0
                                    )
                                
                                
                                    s
                                
                            
                        
                    . Then if n dies are produced, the expected number of good packages is                         
                            E
                            
                                
                                    
                                        
                                            m
                                        
                                        
                                            1
                                        
                                    
                                    
                                        
                                            s
                                        
                                    
                                
                            
                            =
                            
                                
                                    n
                                
                                
                                    s
                                
                            
                            
                                
                                    p
                                    (
                                    H
                                    =
                                    0
                                    )
                                
                                
                                    s
                                
                            
                        
                    , where round-off error is neglected (leftover dies < s not packaged). … we can stack the dies predicted as fail together and the dies predicted as pass together. Using the law of total expectation, the expected number of good packages in this case is given by                         
                            
                                
                                    E
                                    
                                        
                                            
                                                
                                                    m
                                                
                                                
                                                    2
                                                
                                            
                                            
                                                
                                                    s
                                                
                                            
                                        
                                    
                                    =
                                    E
                                    
                                        
                                            E
                                            
                                                
                                                    
                                                        
                                                            m
                                                        
                                                        
                                                            2
                                                        
                                                    
                                                    
                                                        
                                                            s
                                                        
                                                    
                                                
                                                
                                                    k
                                                
                                            
                                        
                                    
                                    =
                                    E
                                    [
                                    k
                                    p
                                    (
                                    H
                                    =
                                    0
                                    |
                                    y
                                    =
                                    0
                                    )
                                
                                
                                    s
                                
                            
                            ]
                            =
                            
                                
                                    n
                                
                                
                                    s
                                
                            
                            
                                
                                    p
                                    
                                        
                                            H
                                            =
                                            0
                                        
                                        
                                            y
                                            =
                                            0
                                        
                                    
                                
                                
                                    s
                                
                            
                            p
                            (
                            y
                            =
                            0
                            )
                        
                    , where k is the number of stacks packaged as high end products and k is subject to a binomial distribution                         
                            B
                            (
                            
                                
                                    n
                                
                                
                                    s
                                
                            
                            ,
                            p
                            
                                
                                    y
                                    =
                                    0
                                
                            
                            )
                        
                    .”).) …
Both Honda in view of Bilenko, in further view of Brownlee and Chen are analogous art since they both teach techniques for processing semiconductor manufacturing data to predict chip yield.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the reduced training set of data taught in Honda in view of Bilenko, in further view of Brownlee and further perform the steps of randomly selecting a package die pass/fail value (where the package die pass/fail value represents a label) based on a binomial distribution taught in Chen as a way to further generate a plurality of randomized training sets of data. The motivation to combine is taught in Chen, as a way to linearly approximate an overall yield improvement for a number of dies in a package that closely matches the predicted values when a classifier is trained on a training set of data (see Chen p.56 Figure 4-9), thus providing a reliable way to perform an aggregated approximation for a label and improving the robustness of the training set (Chen p.57 1st paragraph: “Recalling Equation (4.18), the expected yield improvement is a function of TPR and FPR. However, FPR and TPR are constrained by the ROC curve of the classifier. Then the optimal point (FPR*, TPR*) is where the contour plot is tangent to the ROC curve and the corresponding optimal threshold y* is determined for future prediction. Figure 4-9 gives the ROC curve and contour plots of expected yield improvement𝚬                        
                            [
                            
                                
                                    m
                                
                                
                                    2
                                
                            
                        
                    (s)-                        
                            
                                
                                    m
                                
                                
                                    1
                                
                            
                        
                    (s)] /(n/s) with different s (number of dies in a package). From the contour plot we can see that our linearization in Equation (4.19) is a good approximation even at s = 16, which can be used as a fast estimation of the expected yield improvement.”). 
Regarding amended Claim 22, 
Honda in view of Bilenko, in further view of Brownlee, in even further view of Chen teaches
(Currently Amended) The method of claim 21, wherein, 
for a feature, feature values of different sets of data meet the similarity criterion when at least one of (a), (b), (c) and (d) is met (Examiner’s note: Under its broadest reasonable interpretation, this claim limitation in a method claim recites a contingent clause that effectively renders the subsequent claim language to not be performed because the condition precedent (“when at least one of (a), (b), (c), and (d) is met”) is not required to be met, and the claimed invention can be practiced without the condition occurring. See MPEP 2111.04(II). Applicant is advised to amend the claim to positively cite the condition as being fulfilled, since no patentable weight is given for the subsequent claim language following a contingent clause that does not require the condition to be fulfilled for practicing the claimed invention. However, for the purposes of examination, this contingent clause will be treated as if the condition were fulfilled.): 
(a) the feature values are equal (Examiner’s note: Bilenko teaches aspect values in a training set are attribute values (with each aspect corresponding to “a feature”, and the respective aspect values corresponding to “feature values”) and grouping common aspect values is a form of comparing those inspected aspect values to ensure they are equal to each other (Bilenko [0053]: “… the training system can group aspect values on the basis of shared aspect values. … the training system can ensure that all entries in a particular partition have at least one common aspect value (such as a particular user ID).”).);
(b) the feature values do not differ one from the other more than a threshold (Examiner’s note: As indicated earlier, Bilenko teaches different techniques for grouping data entries into a group, such as grouping aspect values based on a frequency measure in which different values occurs within the master dataset, as well as grouping aspect values using hashing techniques to associate different partitions/groups with different hash buckets, where a hashing function can be applied to route a data entry containing a particular aspect value to a corresponding hash bucket. A person having ordinary skill in the art would understand that techniques such as calculating the frequency of a feature value and grouping data entries based on similar frequency values requires a determination of various frequency ranges in which each of the various data entries contain features exhibiting frequency values within a certain frequency range can be placed together in a same group, and hence represent grouping techniques based on a similarity criterion of applying features together that do not differ by a threshold amount (i.e., a frequency range). Similarly, a hashing technique can also be thought of as another type of similarity criterion that applies a hash function that produces a result based on a calculation that involves applying a threshold range for the feature value, and computing a hash function result that identifies and groups different data entries into different assigned hash buckets representing the hash function result (Bilenko Figure 1, elements 110, 112; [0033], [0036]-[0037]; and [0049]-[0053]).); 
(c) the feature values are equal after the feature values have been approximated; 
(d) at least one of (a), (b), (c) is met and label values of the different sets of data are similar according to a second similarity criterion.  
Regarding previously presented Claim 23, 
Honda in view of Bilenko, in further view of Brownlee, in even further view of Chen teaches
(Previously presented) The method of claim 21, wherein the features correspond to manufacturing data of the electronic item (Examiner’s note: As indicated earlier, Honda teaches preparing a dataset as part of a machine learning pipeline for used in one or more predictive models, where this dataset is generated based on measurements that include measured and engineered/enriched semiconductor chip feature data produced from the WAT/PCM, WS, CP, FT test levels, where the WS and FT test levels produces hardbin labels indicating passed/good chips and failed/bad chips, and identifies poor or healthy wafer quality based on a number of passed chips at the WS test level (Honda Figure 1; [0022]-[0025]; [0088]-[0089]; and [0091]-[0122]).) and 
the at least one label corresponds to at least one quality attribute of the electronic item (Examiner’s note: As indicated earlier, Honda teaches preparing a dataset as part of a machine learning pipeline for used in one or more predictive models, where this dataset is generated based on measurements that include measured and engineered/enriched semiconductor chip feature data produced from the WAT/PCM, WS, CP, FT test levels, where the WS and FT test levels produces hardbin labels indicating passed/good chips and failed/bad chips, and identifies poor or healthy wafer quality based on a number of passed chips at the WS test level, and hence these labels indicating passed/failed chips or poor/healthy wafer quality correspond to at least one quality attribute of an electronic item (Honda Figure 1; [0022]-[0025]; [0088]-[0089]; and [0091]-[0122]).).  
Regarding previously presented Claim 24, 
Honda in view of Bilenko, in further view of Brownlee, in even further view of Chen teaches
(Previously presented) The method of claim 21, wherein for a majority of the groups comprising a plurality of sets of data, a number of aggregated set of data is less than a number of the sets of data of the group by a magnitude of at least ten (Examiner’s note: Under its broadest reasonable interpretation, the term “magnitude” means size or extent, and hence the term “a number of aggregated set of data is less than a number of the sets of data in the group by a magnitude of at least ten” is interpreted to mean that the number of aggregated set of data is at least ten less than the number of sets of data. Bilenko teaches selecting a representative data set instance within a cluster representing a collected of data set instances with a common aspect value (“a plurality of sets of data”), where the selection of the representative data set corresponds to “wherein for a majority of the groups comprising a plurality of sets of data, a number of aggregated set of data is less than a number of the sets of data of the group …”. As shown in Bilenko Figure 4, the representative data set instance (represented by a black dot) is among at least a group of 11 or 12 other representative data set instances, hence corresponding to “a number of aggregated set of data is less than … by a magnitude of at least ten”).  
Regarding previously presented Claim 25, 
Honda in view of Bilenko, in further view of Brownlee, in even further view of Chen teaches
(Previously presented) The method of claim 21, 
wherein the aggregated representation of at least one label value of the one or more sets of data of the group comprises a sum of all label values of the one or more sets of data of the group (Examiner’s note: Bilenko teaches examples of statistical feature information that include counts of label values, with the statistical feature information that includes counts of label values corresponding to “an aggregated representation of at least one label value of the one or more sets of data of the group comprises a sum of all label values of the one or more sets of data of the group” (Bilenko Figure 1, element 114; Bilenko Figure 3; Bilenko Figure 5, element 514; [0038]: “… different implementations may generate different kinds of statistical measures. … the training system can form a count of the label values associated with the training examples.”).).  
Regarding previously presented Claim 26, 
Honda in view of Bilenko, in further view of Brownlee, in even further view of Chen teaches
(Previously presented) The method of claim 21, wherein the randomizing further comprises randomizing, for each aggregated set of data of the reduced training set of data, the aggregated representation of at least one label value associated with the aggregated set of data (Examiner’s note: As indicated earlier, the term “reduced training set of data” exhibits a 112(b) indefiniteness issue, and hence for purposes of examination, this limitation will be interpreted as broadly reciting randomizing an aggregated label value associated within a training set. As indicated earlier, Chen teaches performing random sampling on a large number of testing dies to determine an expected number of good packages based on the probability of frequency of occurrence of good packages, where the random sampling involves randomly packaging the dies into packages or stacks with s number of dies into a stack to determine a probability distribution of good packages that is based on a binomial distribution and a conditional probability distribution of expected good packages containing s dies given the predicted number of good packages containing s dies                         
                            
                                
                                    p
                                    (
                                    H
                                    =
                                    0
                                    |
                                    y
                                    =
                                    0
                                    )
                                
                                
                                    s
                                
                            
                        
                    . This conditional probability represents the frequency of occurrence good s dies within a set of ‘good packages’ (representing a label value associated with a group containing an aggregated set of data), and hence represents a number or count of sets of data for the group of data that is labeled as ‘good packages’. Thus, the calculated expected number of good packages represents the different values of the numbers of ‘good packages’ (representing randomized values of different numbers of sets of data) based on a binomial distribution, where these different values are obtained based the frequency of occurrence of the number of good s dies in a ‘good package’ group (Chen pp.48-50 Section 4.4 Mathematical Formulation).).  
Regarding previously presented Claim 27, 
Honda in view of Bilenko, in further view of Brownlee, in even further view of Chen teaches
(Previously presented) The method of claim 26, wherein: 
randomizing the aggregated representation of at least one label value associated with the aggregated set of data is based on a probability distribution characterizing probability to obtain different values for the aggregated representation of the at least one label value (Examiner’s note: As indicated earlier, Chen teaches performing random sampling on a large number of testing dies to determine an expected number of good packages based on the probability of frequency of occurrence of good packages, where the random sampling involves randomly packaging the dies into packages or stacks with s number of dies into a stack to determine a probability distribution of good packages that is based on a binomial distribution and a conditional probability distribution of expected good packages containing s dies given the predicted number of good packages containing s dies                         
                            
                                
                                    p
                                    (
                                    H
                                    =
                                    0
                                    |
                                    y
                                    =
                                    0
                                    )
                                
                                
                                    s
                                
                            
                        
                    . This conditional probability represents the frequency of occurrence good s dies within a set of ‘good packages’ (representing a label value associated with a group containing an aggregated set of data), and hence represents a number or count of sets of data for the group of data that is labeled as ‘good packages’. Thus, the calculated expected number of good packages represents the different values of the numbers of ‘good packages’ (representing randomized values of different numbers of sets of data) based on a binomial distribution, where these different values are obtained based the frequency of occurrence of the number of good s dies in a ‘good package’ group (Chen pp.48-50 Section 4.4 Mathematical Formulation).).  
Regarding previously presented Claim 29, 
Honda in view of Bilenko, in further view of Brownlee, in even further view of Chen teaches
(Previously presented) The method of claim 21, comprising, by a processing unit (Examiner’s note: As indicated earlier, Honda teaches preparing a dataset based on measurements involving testing individual chips located on wafers in a semiconductor manufacturing process, where these measurements include measured feature data and engineered/enriched data produced from the different WAT/PCM, WS, CP, FT test levels, where the WS and FT test levels produces hardbin labels indicating passed/good chips and failed/bad chips. Honda teaches this dataset is prepared as part of a machine learning pipeline that generates and stores this dataset as training data for one or more predictive models. A person having ordinary skill in the art would understand that performing the set of process steps and machine learning pipeline requires a computing system containing a processor and associated memory (e.g., RAM and disk storage) coupled to each other, where the associated memory stores computer instructions representing these process steps and associated machine learning pipeline to execute the process steps (Honda Figure 3 and [0035]-[0041]; Figure 7 and [0059]; and Figures 12-14).): 
providing at least one set of data comprising a plurality of feature values representative of at least one electronic item, for which at least one label value is to be predicted (Examiner’s note: As indicated earlier, Honda Figures 12-14 teaches providing input data collected from the PCM, WS, FT testing stages into various model architectures to generate prediction results, where the model architectures represent single level and multiple-level models implemented with machine learning algorithms such as decision trees and random forests (Honda [0140]-[0144]).), and
predicting, based on the relationship, the label value associated with the set of data, thereby allowing prediction for the at least one electronic item (Examiner’s note: As indicated earlier, Honda Figures 12-14 teaches providing input data collected from the PCM, WS, FT testing stages into various model architectures to generate prediction results, where the model architectures represent single level and multiple-level models implemented with machine learning algorithms such as decision trees and random forests, and where the prediction result represents the prediction of Returned Merchant Authorizations (RMAs) for packaged electronic chips, expressed as a probability. Honda further teaches that only chips that passed the FT testing are provided to chip users, and hence the prediction of RMAs for packaged electronic chips represents a prediction of whether that associated FT testing label is accurate or not (Honda [0087]-[0089]; [0140]-[0144]). Examiner notes that the claim language “… thereby allowing prediction for the at least one electronic item” recites an intended use of predicting the label value associated with the set of data, where this language is already reflected in the earlier claim limitations “providing at least one set of data comprising a plurality of feature values representative of at least one electronic item, for which at least one label value is to be predicted” and “predicting, based on the relationship, the label value associated with the set of data …”, and therefore is considered as redundant claim language that does not further limit the claim limitation.). 
Regarding previously presented Claim 31, 
Honda in view of Bilenko, in further view of Brownlee, in even further view of Chen teaches
(Previously presented) The method of claim 21, wherein:
the training set data is collected from at least operational data collected from at least a manufacturing line of one or more electronic items (Examiner’s note: Under its broadest reasonable interpretation, the term “operational data collected from at least a manufacturing line” is interpreted as data collected through a normal routine process, i.e., routine testing conducted during a manufacturing process. As indicated earlier, Honda teaches obtaining a training data set from a semiconductor manufacturing process, where this input data comes from the results of testing semiconductor wafers, and consists of computed and measured feature data from WAT/PCM, WS, CP, FT testing, where one of the FT test results includes a label indicating pass/fail (Honda Figure 1; [0022]-[0025]; [0088]-[0089]; and [0091]-[0122]).), 
wherein the method further comprises updating the relationship between the at least one label and the features of the electronic items based on an update of the operational data during manufacturing (Examiner’s note: Honda teaches a machine learning pipeline containing a feature selection pipe, where this feature selection pipe performs determinations as to which sensors that provide the collected input data and/or manufacturing steps may not be providing useful data for training the ML model, and to remove these steps/sensors from training the model (effectively identifying and removing the attributes/features from the manufacturing process that are not considered relevant to the model). Honda Figures 4-6 further teach various processes involving this feature selection functionality, where Figure 4 teaches a scenario where sensors (and their associated collected feature data) that provides the best cross-validation accuracy are retained, while those with least cross-validation accuracy are removed, and where Figures 5 and 6 teach scenarios where sensors that identify key variables (i.e., relevant variables) in a model with the highest accuracy are retained, while those that do not identify key variables are removed (Honda Figures 4-6; and [0041], [0051]-[0054]).).  
Regarding amended Claim 32,
Claim 32 recites a system, where the system comprises of claim limitations that are similar in scope to the corresponding claim limitations recited in Claim 21, and hence is rejected under similar rationale and motivations provided by Honda, Bilenko, Brownlee, and Chen as indicated in Claim 21. In addition, Honda Figures 3 and 7 teach a process and a machine learning pipeline for processing input data collected from data sources during a production run for a semiconductor manufacturing process, and generating multiple predictive models for failure detection and classification, where the input data is for training the multiple predictive models used in different model architectures. A person having ordinary skill in the art would understand that performing the set of process steps and machine learning pipeline requires a computing system containing a processor and associated memory (e.g., RAM and disk storage) coupled to each other, where the associated memory stores computer instructions representing these process steps and associated machine learning pipeline to execute the process steps, where the process steps include storing the input data as well as all outputs resulting from the process steps and machine learning pipeline, and where the outputs include predictions produced from a machine learning model (Honda Figure 3 and [0035]-[0041]; Figure 7 and [0059]; and Figures 12-14).
Regarding amended Claim 33,
Claim 33 recites the system of claim 32, where the system further comprises of claim limitations that are similar in scope to the corresponding claim limitations recited in Claim 22, and hence is rejected under similar rationale provided by Honda, Bilenko, Brownlee, and Chen as indicated in Claim 22, in view of the rejections applied to Claim 32.
Regarding previously presented Claim 34,
Claim 34 recites the system of claim 32, where the system further comprises of claim limitations that are similar in scope to the corresponding claim limitations recited in Claim 23, and hence is rejected under similar rationale provided by Honda, Bilenko, Brownlee, and Chen as indicated in Claim 23, in view of the rejections applied to Claim 32.
Regarding previously presented Claim 35,
Claim 35 recites the system of claim 32, where the system further comprises of claim limitations that are similar in scope to the corresponding claim limitations recited in Claim 26, and hence is rejected under similar rationale and motivations provided by Honda, Bilenko, Brownlee, and Chen as indicated in Claim 26, in view of the rejections applied to Claim 32.
Regarding previously presented Claim 36,
Claim 36 recites the system of claim 32, where the system further comprises of claim limitations that are similar in scope to the corresponding claim limitations recited in Claim 29, and hence is rejected under similar rationale provided by Honda, Bilenko, Brownlee, and Chen as indicated in Claim 29, in view of the rejections applied to Claim 32.
Regarding amended Claim 37,
Claim 37 recites a non-transitory storage device readable by a machine, where the non-transitory storage device embodies a program of instructions executable by a machine to perform operations comprising of claim limitations that are similar in scope to the corresponding claim limitations recited in Claim 21, and hence is rejected under similar rationale and motivations provided by Honda, Bilenko, Brownlee, and Chen as indicated in Claim 21. In addition, Honda Figures 3 and 7 teach a process and a machine learning pipeline for processing input data collected from data sources during a production run for a semiconductor manufacturing process, and generating multiple predictive models for failure detection and classification, where the input data is for training the multiple predictive models used in different model architectures. A person having ordinary skill in the art would understand that performing the set of process steps and machine learning pipeline requires a computing system containing a processor and associated memory (e.g., RAM and disk storage) coupled to each other, where the associated memory stores computer instructions representing these process steps and associated machine learning pipeline to execute the process steps, where the process steps include storing the input data as well as all outputs resulting from the process steps and machine learning pipeline, and where the outputs include predictions produced from a machine learning model (Honda Figure 3 and [0035]-[0041]; Figure 7 and [0059]; and Figures 12-14).
Regarding amended Claim 38,
Claim 38 recites the non-transitory storage device of claim 37, where the non-transitory storage device further comprises of claim limitations that are similar in scope to the corresponding claim limitations recited in Claim 22, and hence is rejected under similar rationale provided by Honda, Bilenko, Brownlee, and Chen as indicated in Claim 22, in view of the rejections applied to Claim 37.
Regarding previously presented Claim 39,
Claim 39 recites the non-transitory storage device of claim 37, where the non-transitory storage device further comprises of claim limitations that are similar in scope to the corresponding claim limitations recited in Claim 26, and hence is rejected under similar rationale and motivations provided by Honda, Bilenko, Brownlee, and Chen as indicated in Claim 26, in view of the rejections applied to Claim 37. 
Regarding previously presented Claim 40,
Claim 40 recites the non-transitory storage device of claim 37, where the system further comprises of claim limitations that are similar in scope to the corresponding claim limitations recited in Claim 29, and hence is rejected under similar rationale provided by Honda, Bilenko, Brownlee, and Chen as indicated in Claim 29, in view of the rejections applied to Claim 37.
Claim 28 is rejected under 35 U.S.C. 103 as being unpatentable over 
Honda et al., U.S. PGPUB 2019/0277913, filed 3/8/2019 [hereafter referred as Honda], in view of Bilenko et al., U.S. PGPUB 2014/0337096, published 11/13/2014 [hereafter referred as Bilenko], in even further view of Brownlee, Jason, Bagging and Random Forest Ensemble Algorithms for Machine Learning, retrieved from web.archive.org dated June 25, 2019 [hereafter referred as Brownlee], in even further view of Chen, Hongge, Novel Machine Learning Approaches for Modeling Variations in Semiconductor Manufacturing (Masters Thesis), June 2017 [hereafter referred as Chen] as applied to Claim 21; in even further view of Graefe, Goetz, Query Evaluation Techniques for Large Databases, June 1993 [hereafter referred as Graefe].
Regarding previously presented Claim 28, 
Honda in view of Bilenko, in further view of Brownlee, in even further view of Chen as applied to Claim 21 teaches
(Previously presented) The method of claim 21,
… wherein the reduced training set of data comprises a plurality of aggregated sets of data (Examiner’s note: As indicated earlier, the term “reduced training set of data” exhibits a 112(b) indefiniteness issue, and hence for purposes of examination, this limitation will be interpreted as broadly reciting a training set containing a plurality of groups containing respective data entries, where these plurality of groups represent a plurality of aggregated sets of data. As indicated earlier, Bilenko teaches a master data set containing a plurality of training examples, where each training example contains aspect values and a label, where these aspect values describe different event characteristics, thus representing features for each training example (Bilenko [0032]-[0033]). Bilenko further teaches a partitioning process to produce training set instances that represent different clusters or partitions, where the partitioning process divides the master data set into multiple partitions according to the feature/attribute values identified within each training example. Bilenko further teaches different techniques for grouping data entries into a group, such as grouping aspect values based on a frequency measure in which different values occurs within the master dataset, as well as grouping aspect values using hashing techniques to associate different partitions/groups with different hash buckets, where a hashing function can be applied to route a data entry containing a particular aspect value to a corresponding hash bucket. A person having ordinary skill in the art would understand that techniques such as calculating the frequency of a feature value and grouping data entries based on similar frequency values requires a determination of various frequency ranges in which each of the various data entries contain features exhibiting frequency values within a certain frequency range can be placed together in a same group, and hence represent grouping techniques based on a similarity criterion of applying features together that do not differ by a threshold amount (i.e., a frequency range). Similarly, a hashing technique can also be thought of as another type of similarity criterion that applies a hash function that produces a result based on a calculation that involves applying a threshold range for the feature value, and computing a hash function result that identifies and groups different data entries into different assigned hash buckets representing the hash function result. With both techniques, the end result is a training set that contains a plurality of clusters or partitions containing groups of data entries that represent a plurality of aggregated sets of data (Bilenko Figure 1, elements 110, 112; [0033], [0036]-[0037]; and [0049]-[0053]).) …
However, Honda in view of Bilenko, in further view of Brownlee, in even further view of Chen does not explicitly teach 
… wherein the number of aggregated sets of data in the reduced training set of data does not increase if the training set of data is expanded with at least one set of data comprising feature values which are similar to feature values already present in the training set of data for at least one set of data, according to the similarity criterion …
Graefe teaches
… wherein the number of aggregated sets of data in the reduced training set of data does not increase if the training set of data is expanded with at least one set of data comprising feature values which are similar to feature values already present in the training set of data for at least one set of data, according to the similarity criterion (Examiner’s note: Under its broadest reasonable interpretation, this claim limitation in a method claim recites a contingent clause that effectively renders the subsequent claim language to not be performed because the condition precedent (“if the training set of data is expanded with at least one set of data …”) is not required to be met, and the claimed invention can be practiced without the condition occurring. See MPEP 2111.04(II). Applicant is advised to amend the claim to positively cite the condition as being fulfilled, since no patentable weight is given for the subsequent claim language following a contingent clause that does not require the condition to be fulfilled for practicing the claimed invention. However, for the purposes of examination, this contingent clause will be treated as if the condition were fulfilled. Graefe teaches introducing new sources of data through a read-ahead operation, and performing a merge operation that involves sorting and duplicate removal on a set of aggregated data receiving additional data, where the read-ahead operation is interpreted as “if the training set of data is expanded with at least one set of data”), and where the sorting and duplicate removal operations is interpreted as identifying “at least one set of data comprising feature values which are similar to feature values already present in the training set of data for at least one set of data, according to the similarity criterion”, with the result of the duplicate removal producing a result in which “the number of aggregated sets of data in the reduced training set of data does not increase …” the overall size of the reduced training data set (Graefe p.100 col.1 Section 4.2 Aggregation Algorithms Based on Sorting 1st – 3rd paragraphs: “Sorting will bring equal items together, and duplicate removal will then be easy. The cost of duplicate removal is dominated by the sort cost, and the cost of this naive duplicate removal algorithm based on sorting can be assumed to be that of the sort operation. For aggregation, items are sorted on their grouping attributes. This simple method can be improved by detecting and removing duplicates as early as possible, easily implemented in the routines that write run files during sorting. With such "early" duplicate removal or aggregation, a run file can never contain more items than the final output (because otherwise it would contain duplicates!), which may speed up the final merges significantly [Bitton and De Witt 1983]. … the operations discussed in the section on sorting, namely read-ahead using forecasting, merge optimizations, large cluster sizes, and reduced final fan-in for binary consumer operations, are fully applicable when sorting is used for aggregation and duplicate removal.”).).  
	Both Honda in view of Bilenko, in further view of Brownlee, in even further view of Chen and Graefe are analogous art since they both teach managing and evaluating data sets containing sets of values using data aggregation techniques involving mathematical functions.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the data cleansing/data pre-processing, feature selection, and feature engineering components taught in Honda in view of Bilenko, in further view of Brownlee, in even further view of Chen and enhance it to support other aggregation functions such as summation and counting, and perform additional reduction techniques such as duplicate removal taught in Graefe as a way to apply other techniques to generate a reduced training set of data comprising aggregated representations of the original training set of data. The motivation to combine is taught in Graefe, as aggregation techniques form the foundation of data processing techniques such as hashing and sorting, both of which help to optimize data storage when dealing with large amounts of data, thus allowing the system to be more memory efficient and also by allowing the concise representations of data to be stored and partitioned for further pipelining and parallelization operations the system, in order to further optimize the performance and efficiency of the system (Graefe p.158 col.1 2nd-3rd paragraphs (Summary and Outlook): A large set of query processing algorithms has been developed for relational systems. Sort- and hash-based techniques have been used for physical storage design, for associative index structures, for algorithms for unary and binary matching operations such as aggregation, duplicate removal, join, intersection, and division, and for parallel query processing using hash- or range partitioning. … Many of the existing algorithms will continue to be useful for extensible and object-oriented systems, and many can easily be generalized from sets of tuples to more general pattern-matching functions. … it allows algebraic optimizations of requests, i.e., optimizing transformations of algebra expressions and cost-sensitive translations of logical into physical expressions. Finally, it permits pipelining between operators to exploit parallel computer architectures and partitioning of stored data and intermediate results for most operators, in particular, for operators on sets but also for other bulk types such as arrays, lists, and time series.”).
Claim 30 is rejected under 35 U.S.C. 103 as being unpatentable over 
Honda et al., U.S. PGPUB 2019/0277913, filed 3/8/2019 [hereafter referred as Honda], in view of Bilenko et al., U.S. PGPUB 2014/0337096, published 11/13/2014 [hereafter referred as Bilenko], in further view of Brownlee, Jason, Bagging and Random Forest Ensemble Algorithms for Machine Learning, retrieved from web.archive.org dated June 25, 2019 [hereafter referred as Brownlee], in even further view of Chen, Hongge, Novel Machine Learning Approaches for Modeling Variations in Semiconductor Manufacturing (Masters Thesis), June 2017 [hereafter referred as Chen] as applied to Claim 21; in even further view of Won et al., Random Forest Model for Silicon-to-SPICE Gap and FinFET Design Attribute Identification, IEIE Transactions on Smart Processing and Computing Vol.5, No.5, October 2016 [hereafter referred as Won].
Regarding previously presented Claim 30, 
Honda in view of Bilenko, in further view of Brownlee, in even further view of Chen as applied to Claim 21 teaches
(Previously presented) The method of claim 21, further comprising …
… by a processing unit (Examiner’s note: As indicated earlier, Honda teaches preparing a dataset based on measurements involving testing individual chips located on wafers in a semiconductor manufacturing process, where these measurements include measured feature data and engineered/enriched data produced from the different WAT/PCM, WS, CP, FT test levels, where the WS and FT test levels produces hardbin labels indicating passed/good chips and failed/bad chips. Honda teaches this dataset is prepared as part of a machine learning pipeline that generates and stores this dataset as training data for one or more predictive models. A person having ordinary skill in the art would understand that performing the set of process steps and machine learning pipeline requires a computing system containing a processor and associated memory (e.g., RAM and disk storage) coupled to each other, where the associated memory stores computer instructions representing these process steps and associated machine learning pipeline to execute the process steps, where the process steps include storing the input data as well as all outputs resulting from the process steps and machine learning pipeline, and where the outputs include predictions produced from a machine learning model (Honda Figure 3 and [0035]-[0041]; Figure 7 and [0059]; and Figures 12-14).) …
… an importance of one or more features (Examiner’s note: As indicated earlier, Brownlee teaches applying bagging/bootstrap aggregation techniques to a plurality of decision tree/random forest trees to identify and determine important variables that lead to a particular prediction or outcome, in order to identify subsets of input variables that may be most or least relevant to the problem (Brownlee p.2 Variable Importance).) …
While Honda in view of Bilenko, in further view of Brownlee, in even further view of Chen teaches applying bagging/bootstrap aggregation techniques to a plurality of decision tree/random forest trees to determine variable importances, Honda in view of Bilenko, in further view of Brownlee, in even further view of Chen does not explicitly teach
… based on the relationship, determining at least one of:
… an importance … with respect to the at least one label, the importance being representative of a level of contribution of the features in the at least one label; and
… an impact of one or more features with respect to the at least one label, the impact being representative of whether the one or more features increase or decrease the at least one label. 
Won teaches
… based on the relationship, determining at least one of:
… an importance … with respect to the at least one label, the importance being representative of a level of contribution of the features in the at least one label (Examiner’s note: Won teaches calculating an importance index based on a ratio of the sum of nodes at which a particular design attribute is used to split the S2S gap data (representing a prediction result located at a particular terminal node) into the next nodes, and the sum of all nodes in the random forest model except for terminal nodes (Won p.363 Equation 8), where larger importance values indicate which design attributes have a larger contribution towards determining the selection of a particular S2S gap data in the model (Won p.363 col.1 3rd paragraph and col.2 3rd paragraph (Section 3.3 Significant Design Attributes 1st paragraph; and p.363 Figure 12).); and
… an impact of one or more features with respect to the at least one label, the impact being representative of whether the one or more features increase or decrease the at least one label (Examiner’s note: Won teaches calculating an impact index for each design attribute based on the mean values shifts in the path from one node to a next left node during random forest node traversal (Won p.363 col.2 Equations 9 and 10), where larger minus or plus impact values indicate which design attributes have more power (influence) to drive (effect) the selection of a particular S2S gap data (representing a prediction result located at a particular terminal node) in the minus or plus directions (Won p.363 col.2 2nd and 4th paragraphs (Section 3.3. Significant Design Attributes); and p.364 Figure 13).).  
Honda in view of Bilenko, in further view of Brownlee, in even further view of Chen and Won are analogous art since they both teach processing semiconductor manufacturing data using bootstrap aggregation techniques.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the bootstrap aggregation techniques taught in Honda in view of Bilenko, in further view of Brownlee, in even further view of Chen and apply learning data related to semiconductor design attributes and S2S gap data taught in Won to further analyze and determine importance and influence of certain design attributes/features (represented by the nodes in a decision tree). The motivation to combine is taught in Won, since metrics such as importance and impact allows process engineers to identify those design attributes/features that have the most contribution to the S2S gap decision output, where this S2S gap is used as a measure of quality for improving chip yield. By identifying the most relevant or contributing design attributes through this analysis, process engineers can focus on either minimizing their influence (if it has a negative impact on the final output result or prediction) or maximizing their influence (if it has a positive impact on the final output result or prediction), and as such, provides valuable diagnostic information to improve overall chip yield in a manufacturing system (Won p.358 Section 1. Introduction 1st paragraph: “To accelerate product yield ramp-up, it is important to characterize a silicon device accurately by measuring a device-under-test (DUT) designed exactly the same as in real production chips. …”; p.359 col.1 1st paragraph (Section 1 Introduction): “S2S gap may come from incorrect modeling for particular design layouts, high layout sensitivity to process fluctuation or defects in layouts, etc. Finding design attributes that result in a large S2S gap and fixing the causes related to the design attributes, such as layout features, are crucial for timely yielding of ramp-up. But the number of design attributes is increasing significantly in the recent technology node, and the impacts of design attributes are sometimes interdependent. So it becomes more and more difficult to accurately analyze the impact of individual design attributes …”; and p.364 col.1 2nd paragraph-col.2 2nd paragraph: “As importance indicates, the S2S gap is clearly classified by the identified significant design attributes … This means that the design attributes identified by importance have an important role in determining the S2S gap. … As impact indicates, the S2S gap is verified to show a clear trend of a larger minus S2S gap under the following design attribute conditions … This means that the design attributes and values (i.e., conditions) identified by the minus value of impact, surely drive the S2S gap into the minus direction. Conversely, the design attributes and values identified by the plus value of impact have a driving force into the plus direction, as well.”).

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action. In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to WILLIAM WAI YIN KWAN whose telephone number is 303-297-4332. The examiner can normally be reached Monday-Friday 8:00am - 4:30pm PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B Zhen can be reached on 571-272-3768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/WILLIAM WAI YIN KWAN/Examiner, Art Unit 2121                                                                                                                                                                                                        


/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121