Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment
This action is in response to the Preliminary Amendment file on 3/20/2021. The claims 1-20 are cancelled and new claims 21-40 are added. The amendment has been entered. Claims 21-40 are pending in the application. 

Specification
The lengthy specification has not been checked to the extent necessary to
determine the presence of all possible minor errors. Applicant's cooperation is
requested in correcting any errors of which applicant may become aware in the
specification.

Examiner Notes
Examiner cites particular columns, paragraphs, figures and line numbers in the references as applied to the claims below for the convenience of the applicant. Although the specified citations are representative of the teachings in the art and are applied to the specific limitations within the individual claim, other passages and figures may apply as well. It is respectfully requested that, in preparing responses, the applicant fully consider the references in their entirety as potentially teaching all or part of the claimed invention, as well as the context of the passage as taught by the prior art or disclosed by the examiner. The entire reference is considered to provide disclosure relating to the 

Claim Objections
Claim 39 is objected to because of the following informalities:  Claim 39 recites “and j repeat procedures (d)-(f) until the comparison …”. Examiner proposed to amend the limitation as “and (j) repeat procedures (d)-(f) until the comparison …”. Appropriate correction is required.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:


Claims 21-40 is/are rejected under 35 U.S.C. 103 as being unpatentable over Maria et al (NPL: DataGenCARS: A generator of synthetic data for the evaluation of context-aware recommendation systems, 2017), hereinafter Maria, in view of Veeramachaneni et al (US 2018/0165475 A1), hereinafter Veeramachaneni.

Claim 21. (New) A non-transitory computer-accessible medium having stored thereon computer- executable instructions for evaluating at least one synthetic dataset, wherein, when a computer arrangement executes the instructions, the computer arrangement is configured to perform procedures comprising: 
Maria: (page 516 Introduction) “fields of mobile computing and recommendation systems” [correspond to computer system comprise of processor coupled with memory]
Maria discloses receiving at least one original dataset; receiving the at least one synthetic dataset; 
Maria: (page 529 section 4.2) “First, in Section 4.2.1, we present an experiment where we generate a synthetic dataset [correspond to receive syntactic dataset related to the original dataset] of items that exhibits features similar to those present in a pre-existing real dataset. [correspond to receive original dataset]”
Maria discloses training at least one model using the at least one original dataset and the at least one synthetic dataset; 

Maria discloses generating a statistical correlation score based on the at least one synthetic dataset and the at least one original dataset; generating an evaluation score by evaluating the at least one synthetic dataset based on the training of the least one model, wherein the evaluation score includes the statistical correlation score; 
Maria: (page 530 section 4.2.1) “With the original dataset, Weka obtained the following performance metrics: MAE = 0.0342, precision = 0.96, recall = 0.96, and F-measure = 0.96; with the synthetic dataset, it obtained:  MAE = 0.0177, precision = 0.987, recall = 0.987, and F-measure = 0.987. [correspond to generating a statistical correlation score based on the at least one synthetic dataset and the at least one original dataset] As it can be observed, the performance over the two datasets is similar. Moreover, we also evaluated the performance of the classification algorithm when the entire synthetic 

Maria does not appear to explicitly disclose the following limitations:-
determining a region for the at least one synthetic dataset based on the evaluation score, wherein the region includes one of (i) a normal region where the at least one synthetic dataset is unlikely to contain synthetic data that is similar to original data within the at least one original dataset, (ii) a warning region where the at least one synthetic dataset at least one of (a) potentially contains the synthetic data that is similar to the original data or 3U.S. PATENT APPLICATION NO. TBAPRELIMINARY AMENDMENT (b) the synthetic data does not substantially match a schema of the at least one original dataset, or (iii) a red flag region where the at least one synthetic dataset is likely to contain the synthetic data that is similar to the original data; and 
generating a suggestion based on the evaluation score and the determined region, wherein the suggestion includes one of (i) indicating that the at least one synthetic dataset is adequate or (ii) warning that the at least one synthetic dataset potentially contains information similar to the at least one original dataset.  

However, Veeramachaneni discloses determining a region for the at least one synthetic dataset based on the evaluation score, wherein the region includes one of (i) a normal region where the at least one synthetic dataset is unlikely to contain synthetic data that is similar to original data within the at least one original dataset, (ii) a warning region where the at least one synthetic dataset at least one of (a) potentially contains the synthetic data that is similar to the original data or 3U.S. PATENT APPLICATION NO. TBAPRELIMINARY AMENDMENT (b) the synthetic data does not substantially match a schema of the at least one original dataset, or (iii) a red flag region where the at least one synthetic dataset is likely to contain the synthetic data that is similar to the original data; on [0301-0308] To test for this, following steps were performed for each subject's submitted work: [1) Let c be the original control dataset. Let v be the version of the dataset that this subject was given. 2) If c≠v, split v into a train set and validation set. 3) Use the train split to create a model using the submitted features, f. 4) Record the accuracy of f on the validation split. This is the synthetic score, A.sub.s(f). 5) Now use f to predict values in the original dataset, c. Record the accuracy as the real score, A.sub.r(f). Thus, for every subject who was not in the control group, a synthetic score can be calculated, A.sub.S(f), and a corresponding real score can be calculated, A.sub.r(f) for their features. The synthetic score simulates the data scientist's estimate of how accurate their work is. The real score is the actual accuracy. Hypothesis: There is a strong correlation between the synthetic score and the real score for each subject's work. A generally positive correlation means that the synthesized datasets give 
[0285-0289] “For each dataset, the SDV created four versions of data, each a condition for a within-subjects experiment with hired data scientists. These conditions were: 1) Control: The subject is presented with the original version of the dataset. 2) No Noise (Synthesized): The subject is presented with the synthesized output from the SDV's algorithm. [correspond to potentially contains the synthetic data that is similar to the original data or one synthetic dataset is likely to contain the synthetic data that is similar to the original data] 3) Table Noise (Synthesized): The subject is presented with synthesized noised output from the SDV's algorithm. The noise is introduced by taking every covariance value, σ.sub.ij, i≠j and halving it, effectively reducing the strength of the covariance. [correspond to one synthetic dataset is unlikely to contain synthetic data that is similar to original data within the at least one original dataset]  4) Key Noise (Synthesized): The subject is presented with synthesized noised output from the SDV's algorithm. The noise is introduced by randomly sampling a primary key for the foreign key relation instead of performing an inference.”
Examiner consider synthetic dataset with different noise correspond to different regions. The synthetic score is used to determine type of region. The performance metrics of Maria is used in combination with synthetic score for determining type of region.
However, Veeramachaneni discloses generating a suggestion based on the evaluation score and the determined region, wherein the suggestion includes one of (i) indicating that the at least one synthetic dataset is adequate on [0332-0339] “4. Demonstrate that the SDV meets its goals for usability and generalizability by using  gives effective feedback regarding its application to real data does not interfere with the data scientists' ability to make accurate predictions does not produce confusing data that impedes the data scientists' progress. Thus, the SDV successfully builds generative models for relational databases, and is a viable solution for synthesizing data.”
[0308-0310] Hypothesis: There is a strong correlation between the synthetic score and the real score for each subject's work. A generally positive correlation means that the synthesized datasets give feedback that reasonably estimates the correct feedback. This implies that the synthesized data can be used successfully for data science. … Afterwards, a 2-sample paired t-test was performed on each submission's synthetic and accuracy score. The result showed that there was no significant difference between the two scores (t=0.812, p=0.427). This enables to conclude that A.sub.r(f)≈A.sub.s(f), a tighter constraint than we had initially set out to prove. It supports the belief that 
[0327] “The synthetic output from SDV can replace original data for the purposes of data science. The results indicate that data scientists were able to work as effectively with the synthetic output as they were with the original data. In particular, a regression between the cross validation and test score showed that the synthetic data gave the correct feedback to data scientists when validating their models (p<0.001). A comparison in overall accuracies between the original and synthetic data showed no statistically significant effects between the type of data and the data scientist's ultimate performance on the test set.”
Maria and Veeramachaneni are analogous art because they are from the [insert the phrase “same field of endeavor” syntactic data analysis.
Before the effective filing date of the claimed invention, it would have been obvious to one of ordinary skill in the art, having the teachings of Maria and Veeramachaneni before him or her, to modify the method of Maria to include the validation and accuracy test feature of Veeramachaneni because this combination provide a synthetic data that could be used for replacing original data for the purposes of data science while maintaining the accuracy and performance. 
The suggestion/motivation for doing so would have been Veeramachaneni [0325] “The SDV was successful for each of the goals for generalizability, usability, and accuracy. The SDV can be applied generally to a variety of relational datasets. During the experimentation phase, the SDV was applied to Biodegradability, Mutagenesis, 
Therefore, it would have been obvious to combine Maria and Veeramachaneni to obtain the invention as specified in the instant claim(s).

Claim 22. (New) The computer-accessible medium of claim 21, wherein the at least one model includes a first model and a second model, and wherein the computer arrangement is further configured to: Maria discloses train the first model using the at least one original dataset; and train the second model using the at least one synthetic dataset.  
Maria: (page 530 section 4.2.1) “First, we analyzed the main features of the Iris dataset and we observed that there is a strong correlation between some of its attributes. This implies that some attributes have to be generated at the same time to ensure consistent values. For that purpose, we defined an item profile for each of the three classes considered in the Iris dataset. The attributes considered relevant for the item profile are in this case ‘‘petallength’’ and ‘‘petalwidth’’, and for each of the three classes we determined the appropriate range of values for those attributes, based on the real dataset available. Then, we generated 150 instances (like in the original dataset) and we evaluated a Naïve Bayes classification algorithm on both the original (correspond to train first model using original dataset) and the synthetic datasets (correspond to train second model using synthetic dataset)”

Claim 23. (New) The computer-accessible medium of claim 22, Maria discloses wherein the computer arrangement is configured to evaluate the at least one synthetic dataset by comparing first results from the training of the first model to second results from the training of the second model.  
Maria: (page 530 section 4.2.1) “With the original dataset, Weka obtained the following performance metrics: MAE = 0.0342, precision = 0.96, recall = 0.96, and F-measure = 0.96; with the synthetic dataset, it obtained: MAE = 0.0177, precision = 0.987, recall =0.987, and F-measure =0.987. As it can be observed, the performance over the two datasets is similar. Moreover, we also evaluated the performance of the classification algorithm (correspond to evaluate the synthetic dataset by comparing first results to second result) when the entire synthetic dataset (correspond to second result from the training of the second model) is used for training and the entire real dataset (correspond to first result from the training of the first model) is used for testing, obtaining the following performance metrics, which are also similar: MAE = 0.0519, precision =0.949, recall =0.94, and F-measure =0.94. We also compared the attribute values themselves (average values, standard deviation, percentage of unique values, and number of items of each profile), and the results obtained are also similar.”

Claim 24. (New) The computer-accessible medium of claim 23, Maria discloses wherein the computer arrangement is configured to compare the first results to the second results using an analysis of variance procedure.  
Maria: (page 530 section 4.2.1) “With the original dataset, Weka obtained the following performance metrics: MAE = 0.0342, precision = 0.96, recall = 0.96, and F-measure =  when the entire synthetic dataset (correspond to second result from the training of the second model) is used for training and the entire real dataset (correspond to first result from the training of the first model) is used for testing, obtaining the following performance metrics (correspond to analysis of variance procedure), which are also similar: MAE = 0.0519, precision =0.949, recall =0.94, and F-measure =0.94. We also compared the attribute values themselves (average values, standard deviation, percentage of unique values, and number of items of each profile) (correspond to analysis of variance procedure), and the results obtained are also similar.”

Claim 25. (New) The computer-accessible medium of claim 22, Maria discloses wherein the computer arrangement is configured to compare the first results to the second results using a threshold procedure.  
Maria: (page 534 section 4.4)

    PNG
    media_image1.png
    754
    1164
    media_image1.png
    Greyscale


Maria: (page 527 section 4) “In particular, we compare an approach of Contextual Modeling (CM) [11,12] as a classification algorithm based on Naïve Bayes [13] with a traditional user–user collaborative filtering algorithm based on SVD (Singular Value Decomposition) [14,15]; in both cases, the class of items to recommend contains the items whose predicted rating (from one to five) is above a threshold of three. (correspond to threshold procedure)”
Examiner considers “a threshold procedure” is corresponding to predicted rating is above a threshold of three. The predicted rating is included in the step of “Comparable result?” in Fig. 24. Thus, examiner considers “Compare result?” is corresponding to “compare the first results to the second results using a threshold procedure”.


Claim 26. (New) The computer-accessible medium of claim 25, Maria discloses wherein the threshold procedure includes: summing first errors from the first results; summing second errors from the second results; and comparing the summed first errors to the summed second errors.  
Maria: (page 530 section 4.2.1) “With the original dataset, Weka obtained the following performance metrics: (Mean Square Errors) MAE = 0.0342 (correspond to summing first errors from the first results), precision = 0.96, recall = 0.96, and F-measure = 0.96; with the synthetic dataset, it obtained: MAE = 0.0177 (correspond to summing second errors from the second results), precision = 0.987, recall =0.987, and F-measure =0.987. As it can be observed, the performance over the two datasets is similar. Moreover, we also evaluated the performance of the classification algorithm when the entire synthetic dataset is used for training and the entire real dataset is used for testing, obtaining the following performance metrics, which are also similar: MAE = 0.0519 precision =0.949, recall =0.94, and F-measure =0.94. We also compared the attribute values themselves (average values, standard deviation, percentage of unique values, and number of items of each profile) (correspond to comparing the summed first errors to the summed second errors), and the results obtained are also similar.”
Examiner considers the (Mean Square Errors) MAE is corresponding to summing errors.


Claim 27. (New) The computer-accessible medium of claim 26, Maria discloses wherein the computer arrangement is configured to compare the summed first errors to the summed second errors using a threshold criterion.  
Maria: (page 530 section 4.2.1) “With the original dataset, Weka obtained the following performance metrics: (Mean Square Errors) MAE = 0.0342 (correspond to summing first errors from the first results), precision = 0.96, recall = 0.96, and F-measure = 0.96; with the synthetic dataset, it obtained: MAE = 0.0177 (correspond to summing second errors from the second results), precision = 0.987, recall =0.987, and F-measure =0.987. As it can be observed, the performance over the two datasets is similar. Moreover, we also evaluated the performance of the classification algorithm when the entire synthetic dataset is used for training and the entire real dataset is used for testing, obtaining the following performance metrics, which are also similar: MAE = 0.0519 precision =0.949, recall =0.94, and F-measure =0.94. We also compared the attribute values themselves (average values, standard deviation, percentage of unique values, and number of items 
Examiner considers the (Mean Square Errors) MAE is corresponding to summing errors.
Maria: (page 527 section 4) “In particular, we compare an approach of Contextual Modeling (CM) [11,12] as a classification algorithm based on Naïve Bayes [13] with a traditional user–user collaborative filtering algorithm based on SVD (Singular Value Decomposition) [14,15]; in both cases, the class of items to recommend contains the items whose predicted rating (from one to five) is above a threshold of three. (correspond to threshold procedure)”
Examiner considers “a threshold criterion” is corresponding to predicted rating is above a threshold of three. The predicted rating is included in the step of “Comparable result?” in Fig. 24. Thus, examiner considers “Compare result?” is corresponding to “compare the summed first errors to the summed second errors using a threshold criterion”.

Claim 28. (New) The computer-accessible medium of claim 25, Maria discloses wherein the threshold procedure includes determining a further statistical correlation based on a plurality of covariance matrices.  
Maria: (page 531) “The first experiment generates a synthetic dataset that tries to replicate the original LDOS-CoMoDa dataset, by applying the workflow described in Fig. 10. After replicating the dataset, we compared the histograms of the different user attributes in the original and generated datasets, the distributions of the context variables in both, and the corresponding statistical properties of the ratings generated. 
See page 524 for description of matrix.
Maria (page 524) “In our case, A is a rectangular matrix of dimension MxN, X is the vector of weights of the utility function that we want to determine (i.e., the unknowns of the system of linear equations), and B is the vector containing the ratings provided by the user. Each row of the matrix A represents the scores (on a scale of 1 to 5) of the different attributes characterizing the corresponding rating in the vector B,M is the number of ratings per user (i.e., the length of the vector B), and N is the number of attributes characterizing the rating (including item attributes and context attributes).”

Claim 29. (New) The computer-accessible medium of claim 22, Maria discloses wherein the first model is equivalent to the second model.  
Maria: (page 530 section 4.2.1) “First, we analyzed the main features of the Iris dataset and we observed that there is a strong correlation between some of its attributes. This implies that some attributes have to be generated at the same time to ensure consistent values. For that purpose, we defined an item profile for each of the three classes considered in the Iris dataset. The attributes considered relevant for the item profile are in this case ‘‘petallength’’ and ‘‘petalwidth’’, and for each of the three classes we determined the appropriate range of values for those attributes, based on the real dataset available. Then, we generated 150 instances (like in the original dataset) and we evaluated a Naïve Bayes classification algorithm on both the original (correspond to train first model using original dataset) and the synthetic datasets (correspond to train second model using synthetic dataset)”
Examiner considers the first model is equivalent to the second model since both model are using Naïve Bayes classification algorithm.

Claim 30. (New) The computer-accessible medium of claim 21, Maria discloses wherein the at least one model is a classification model.  
Maria: (page 530 section 4.2.1) “First, we analyzed the main features of the Iris dataset and we observed that there is a strong correlation between some of its attributes. This implies that some attributes have to be generated at the same time to ensure consistent values. For that purpose, we defined an item profile for each of the three classes considered in the Iris dataset. The attributes considered relevant for the item profile are 

Claim 31. (New) The computer-accessible medium of claim 21, Maria discloses wherein the computer arrangement is further configured to generate the at least one synthetic dataset.  
Maria: (page 529 section 4.2) “First, in Section 4.2.1, we present an experiment where we generate a synthetic dataset of items that exhibits features similar to those present in a pre-existing real dataset.”

Claim 32. (New) The computer-accessible medium of claim 31, Maria discloses wherein the computer arrangement is configured to generate the at least one synthetic dataset based on the at least one original dataset.  
Maria: (page 529 section 4.2) “First, in Section 4.2.1, we present an experiment where we generate a synthetic dataset of items that exhibits features similar to those present in a pre-existing real dataset (correspond to original data).”

Claim 33. (New) The computer-accessible medium of claim 21, Maria discloses wherein the computer arrangement is further configured to generate at least one further synthetic dataset based on (i) the at least one synthetic dataset and (ii) the evaluation of the at least one synthetic dataset.  

    PNG
    media_image2.png
    754
    1164
    media_image2.png
    Greyscale
Maria: (page 526) “As a summary, Fig. 24 shows some basic tasks that can be performed to evaluate synthetic datasets; it should be noted that, for simplicity, the performance metrics shown in the figure are related to the prediction accuracy of the recommendation algorithms, but any other relevant evaluation metric could be considered.”
Examiner considers the step after the result of the comparison is not good, the process is back to generating (e.g. new) synthetic dataset based on the result of the comparison of original vs synthetic and previous synthetic dataset.

Claim 34. (New) The computer-accessible medium of claim 21, Maria discloses wherein the at least one original dataset and the at least one synthetic dataset include at least one of (i) biographical information regarding a plurality of customers or (ii) financial information regarding the plurality of customers.  
Maria: (page 527-528) “Specifically, we have synthetically generated a number of ratings (this number varies depending on the experiment), which are values in the range from one to five, for a scenario consisting of 943 users, 1682 restaurants, and 900 contexts. Each rating is tagged with a time and date in the range of the years 1980–2000. The schemas of users, types of items, and contexts considered, are defined as follows (in the case of categorical attributes, we indicate the possible values in brackets): • Users: age, gender, occupation. • Restaurants: web_name, address, province, country, phone, weekday_is_open, hour, type_of_food, card, outside, bar, parking, reservation, price, quality_food, quality_service, quality_price, global_rating. • Contexts: transport_way (walking, bicycle, car, public), mobility (stopped, moving), weekday (week, weekend), mood (happy, sad, active, lazy), season (spring, summer, autumn, winter), companion (alone, friends, family, girlfriend, children), temperature (warm, hot, cold), weather (sunny, cloudy, rainy, snowing), distance (near, far), time_of_day (morning, night, afternoon). …” (correspond to biographical information regarding a plurality of customers)

Claim 35. (New) A method for evaluating at least one synthetic dataset, comprising: 
 (a) receiving at least one original dataset; (b) generating the at least one synthetic dataset based on the at least one original dataset; 
Maria: (page 529 section 4.2) “First, in Section 4.2.1, we present an experiment where we generate a synthetic dataset (correspond to generate syntactic dataset related to the original dataset) of items that exhibits features similar to those present in a pre-existing real dataset. (correspond to receive original dataset)”
Maria discloses (c) training at least one first model using the at least one original dataset; (d) training at least one second model using the at least one synthetic dataset; 
Maria: (page 530 section 4.2.1) “First, we analyzed the main features of the Iris dataset and we observed that there is a strong correlation between some of its attributes. This implies that some attributes have to be generated at the same time to ensure consistent values. For that purpose, we defined an item profile for each of the three classes considered in the Iris dataset. The attributes considered relevant for the item profile are in this case ‘‘petallength’’ and ‘‘petalwidth’’, and for each of the three classes we determined the appropriate range of values for those attributes, based on the real dataset available. Then, we generated 150 instances (like in the original dataset) and we evaluated a Naïve Bayes classification algorithm on both the original (correspond to train first model using original dataset) and the synthetic datasets (correspond to train second model using synthetic dataset)”
Maria discloses (e) generating a statistical correlation score based on the at least one synthetic dataset and the at least one original dataset using a computer arrangement, generating an evaluation score by evaluating the at least 6U.S. PATENT APPLICATION NO. TBA PRELIMINARY AMENDMENTone synthetic dataset based on the training of the least one first model and the training of the at least one second model, wherein the evaluation score includes the statistical correlation score; 
Maria: (page 530 section 4.2.1) “With the original dataset, Weka obtained the following performance metrics: MAE = 0.0342, precision = 0.96, recall = 0.96, and F-measure = 0.96; with the synthetic dataset, it obtained:  MAE = 0.0177, precision = 0.987, recall = 0.987, and F-measure = 0.987. [correspond to generating a statistical correlation score based on the at least one synthetic dataset and the at least one original dataset] As it can be observed, the performance over the two datasets is similar. Moreover, we also evaluated the performance of the classification algorithm when the entire synthetic dataset is used for training and the entire real dataset is used for testing, obtaining the following performance metrics, which are also similar: MAE = 0.0519, precision = 0.949, recall = 0.94, and F–measure = 0.94. [correspond to generating an evaluation score by evaluating the at least one synthetic dataset based on the training of the least one model] We also compared the attribute values themselves (average values, standard deviation, percentage of unique values, and number of items of each profile), and the results obtained are also similar. [correspond to the evaluation score includes the statistical correlation score]”

Maria does not appear to explicitly disclose the following limitations:-
determining a region for the at least one synthetic dataset based on the evaluation score, wherein the region includes one of (i) a normal region where the at least one synthetic dataset is unlikely to contain synthetic data that is similar to original data within the at least one original dataset, (ii) a warning region where the at least one synthetic dataset at least one of (a) potentially contains the synthetic data that is similar to the original data or (b) the synthetic data does not substantially match a schema of the at least one original dataset, or (iii) a red flag region where the at least one synthetic dataset is likely to contain the synthetic data that is similar to the original data; and 
generating a suggestion based on the evaluation score and the determined region, wherein the suggestion includes one of (i) indicating that the at least one synthetic dataset is adequate or (ii) warning that the at least one synthetic dataset potentially contains information similar to the at least one original dataset.  

However, Veeramachaneni discloses determining a region for the at least one synthetic dataset based on the evaluation score, wherein the region includes one of (i) a normal region where the at least one synthetic dataset is unlikely to contain synthetic data that is similar to original data within the at least one original dataset, (ii) a warning region where the at least one synthetic dataset at least one of (a) potentially contains the synthetic data that is similar to the original data or 3U.S. PATENT APPLICATION NO. TBAPRELIMINARY AMENDMENT (b) the synthetic data does not substantially match a schema of the at least one original dataset, or (iii) a red flag region where the at least one synthetic dataset is likely to contain the synthetic data that is similar to the original data; on [0301-0308] To test for this, following steps were performed for each subject's submitted work: [1) Let c be the original control dataset. Let v be the version of the dataset that this subject was given. 2) If c≠v, split v into a train set and validation 
[0285-0289] “For each dataset, the SDV created four versions of data, each a condition for a within-subjects experiment with hired data scientists. These conditions were: 1) Control: The subject is presented with the original version of the dataset. 2) No Noise (Synthesized): The subject is presented with the synthesized output from the SDV's algorithm. [correspond to potentially contains the synthetic data that is similar to the original data or one synthetic dataset is likely to contain the synthetic data that is similar to the original data] 3) Table Noise (Synthesized): The subject is presented with synthesized noised output from the SDV's algorithm. The noise is introduced by taking every covariance value, σ.sub.ij, i≠j and halving it, effectively reducing the strength of the covariance. [correspond to one synthetic dataset is unlikely to contain synthetic data that is similar to original data within the at least one original dataset]  4) Key Noise (Synthesized): The subject is presented with synthesized noised output from the SDV's 
Examiner consider synthetic dataset with different noise correspond to different regions. The synthetic score is used to determine type of region. The performance metrics of Maria is used in combination with synthetic score for determining type of region.
However, Veeramachaneni discloses generating a suggestion based on the evaluation score and the determined region, wherein the suggestion includes one of (i) indicating that the at least one synthetic dataset is adequate on [0332-0339] “4. Demonstrate that the SDV meets its goals for usability and generalizability by using it to model 6 different datasets from a combination of sources: major software consulting firm, the relational database repository, and Kaggle™ 5. Evaluate the SDV's ability to synthesize data for sample databases by working a real-world complex relational database from our sponsor. Demonstrate that the SDV synthesizes data that be used for testing. 6. Formulate metrics to quantify how much synthesized data affects the ability to solve a prediction problem. 7. Perform experiment using Feature Factory, and analyze submitted features to demonstrate that synthetic output from SDV: [correspond to generating a suggestion based on the evaluation score and the determined region, wherein the suggestion includes one of (i) indicating that the at least one synthetic dataset is adequate] gives effective feedback regarding its application to real data does not interfere with the data scientists' ability to make accurate predictions does not produce confusing data that impedes the data scientists' progress. Thus, the SDV successfully builds generative models for relational databases, and is a viable solution for synthesizing data.”

[0327] “The synthetic output from SDV can replace original data for the purposes of data science. The results indicate that data scientists were able to work as effectively with the synthetic output as they were with the original data. In particular, a regression between the cross validation and test score showed that the synthetic data gave the correct feedback to data scientists when validating their models (p<0.001). A comparison in overall accuracies between the original and synthetic data showed no statistically significant effects between the type of data and the data scientist's ultimate performance on the test set.”
Maria and Veeramachaneni are analogous art because they are from the [insert the phrase “same field of endeavor” syntactic data analysis.
Before the effective filing date of the claimed invention, it would have been obvious to one of ordinary skill in the art, having the teachings of Maria and Veeramachaneni before him or her, to modify the method of Maria to include the 
The suggestion/motivation for doing so would have been Veeramachaneni [0325] “The SDV was successful for each of the goals for generalizability, usability, and accuracy. The SDV can be applied generally to a variety of relational datasets. During the experimentation phase, the SDV was applied to Biodegradability, Mutagenesis, Airbnb, Rossmann, Telstra, and industrial datasets. The SDV was able to model the relational data automatically for each of these datasets, with no changes to the code.”
Therefore, it would have been obvious to combine Maria and Veeramachaneni to obtain the invention as specified in the instant claim(s).

Claim 36. (New) The method of claim 35, Maria discloses further comprising generating at least one further synthetic dataset based on the evaluation score and the at least one synthetic dataset.  

    PNG
    media_image2.png
    754
    1164
    media_image2.png
    Greyscale
Maria: (page 526) “As a summary, Fig. 24 shows some basic tasks that can be performed to evaluate synthetic datasets; it should be noted that, for simplicity, the performance metrics shown in the figure are related to the prediction accuracy of the recommendation algorithms, but any other relevant evaluation metric could be considered.”
Examiner considers the step after the result of the comparison is not good, the process is back to generating (e.g. new) synthetic dataset based on the result of the comparison of original vs synthetic and previous synthetic dataset.

Claim 37. (New) The method of claim 36, Maria discloses further comprising training the at least one second model based on the at least one further synthetic dataset.  


    PNG
    media_image2.png
    754
    1164
    media_image2.png
    Greyscale

Examiner considers the training process is repeated for new synthetic dataset. 

Claim 38. (New) The method of claim 37, Maria discloses further comprising evaluating the at least one further synthetic dataset based on the training of the at least one second model on the at least one further synthetic dataset.  

    PNG
    media_image2.png
    754
    1164
    media_image2.png
    Greyscale

Examiner considers the evaluation or comparison process is repeated for new synthetic dataset. 

Claim 39. (New) A system, comprising: a computer hardware arrangement configured to: 
Maria discloses (a) receive at least one original dataset; (b) receive at least one synthetic dataset related to the at least one original dataset; 

Maria discloses (c) train at least one first model using the at least one original dataset; 
Maria: (page 530 section 4.2.1) “First, we analyzed the main features of the Iris dataset and we observed that there is a strong correlation between some of its attributes. This implies that some attributes have to be generated at the same time to ensure consistent values. For that purpose, we defined an item profile for each of the three classes considered in the Iris dataset. The attributes considered relevant for the item profile are in this case ‘‘petallength’’ and ‘‘petalwidth’’, and for each of the three classes we determined the appropriate range of values for those attributes, based on the real dataset available. Then, we generated 150 instances (like in the original dataset) and we evaluated a Naïve Bayes classification algorithm on both the original (correspond to train one model using original dataset) and the synthetic datasets”
Maria discloses (d) train at least one second model using the at least one synthetic dataset; 
Maria: (page 530 section 4.2.1) “First, we analyzed the main features of the Iris dataset and we observed that there is a strong correlation between some of its attributes. This implies that some attributes have to be generated at the same time to ensure consistent values. For that purpose, we defined an item profile for each of the three classes considered in the Iris dataset. The attributes considered relevant for the item profile are 
Maria discloses (e) generate a statistical correlation score based on the at least one synthetic dataset and the at least one original dataset; (f) generate an evaluation score by comparing first results from the training of the first model to second results from the training of the second model, wherein the evaluation score includes the statistical correlation score; 
Maria: (page 530 section 4.2.1) “With the original dataset, Weka obtained the following performance metrics: MAE = 0.0342, precision = 0.96, recall = 0.96, and F-measure = 0.96; with the synthetic dataset, it obtained:  MAE = 0.0177, precision = 0.987, recall = 0.987, and F-measure = 0.987. [correspond to generating a statistical correlation score based on the at least one synthetic dataset and the at least one original dataset] As it can be observed, the performance over the two datasets is similar. Moreover, we also evaluated the performance of the classification algorithm when the entire synthetic dataset is used for training and the entire real dataset is used for testing, obtaining the following performance metrics, which are also similar: MAE = 0.0519, precision = 0.949, recall = 0.94, and F–measure = 0.94. [correspond to generating an evaluation score by evaluating the at least one synthetic dataset based on the training of the least one model] We also compared the attribute values themselves (average values, standard deviation, percentage of unique values, and number of items of each profile), and the 
(i) modify the at least one synthetic dataset based on the comparison; and 
(j) repeat procedures (d)-(f) until the comparison of the first results to the second results is less than a particular threshold.  
Maria: (page 534 section 4.4)

    PNG
    media_image1.png
    754
    1164
    media_image1.png
    Greyscale

Maria: (page 527 section 4) “In particular, we compare an approach of Contextual Modeling (CM) [11,12] as a classification algorithm based on Naïve Bayes [13] with a traditional user–user collaborative filtering algorithm based on SVD (Singular Value Decomposition) [14,15]; in both cases, the class of items to recommend contains the items whose predicted rating (from one to five) is above a threshold of three. (correspond to particular threshold)”

An ordinary person in the art know that the threshold can be set as above or below and it depend on the configuration of the model.

Maria does not appear to explicitly disclose the following limitations:-
(g) determine a region for the at least one synthetic dataset based on the evaluation score, wherein the region includes one of (i) a normal region where the at least one synthetic dataset is unlikely to contain synthetic data that is similar to original data within the at least one original dataset, (ii) a warning region where the at least one synthetic dataset at least one of (a) potentially contains the synthetic data that is similar to the original data or 8U.S. PATENT APPLICATION NO. TBAPRELIMINARY AMENDMENT (b) the synthetic data does not substantially match a schema of the at least one original dataset, or (iii) a red flag region where the at least one synthetic dataset is likely to contain the synthetic data that is similar to the original data; and 
(h) generate a suggestion based on the evaluation score and the determined region, wherein the suggestion includes one of (i) indicating that the at least one synthetic dataset is adequate or (ii) warning that the at least one synthetic dataset potentially contains information similar to the at least one original dataset; 

However, Veeramachaneni discloses determining a region for the at least one synthetic dataset based on the evaluation score, wherein the region includes one of (i) a normal region where the at least one synthetic dataset is unlikely to contain synthetic data that is similar to original data within the at least one original dataset, (ii) a warning region where the at least one synthetic dataset at least one of (a) potentially contains the synthetic data that is similar to the original data or 3U.S. PATENT APPLICATION NO. TBAPRELIMINARY AMENDMENT (b) the synthetic data does not substantially match a schema of the at least one original dataset, or (iii) a red flag region where the at least one synthetic dataset is likely to contain the synthetic data that is similar to the original data; on [0301-0308] To test for this, following steps were performed for each subject's submitted work: [1) Let c be the original control dataset. Let v be the version of the dataset that this subject was given. 2) If c≠v, split v into a train set and validation set. 3) Use the train split to create a model using the submitted features, f. 4) Record the accuracy of f on the validation split. This is the synthetic score, A.sub.s(f). 5) Now use f to predict values in the original dataset, c. Record the accuracy as the real score, A.sub.r(f). Thus, for every subject who was not in the control group, a synthetic score can be calculated, A.sub.S(f), and a corresponding real score can be calculated, A.sub.r(f) for their features. The synthetic score simulates the data scientist's estimate of how accurate their work is. The real score is the actual accuracy. Hypothesis: There is a strong correlation between the synthetic score and the real score for each subject's work. A generally positive correlation means that the synthesized datasets give feedback that reasonably estimates the correct feedback. This implies that the synthesized data can be used successfully for data science.”
[0285-0289] “For each dataset, the SDV created four versions of data, each a condition for a within-subjects experiment with hired data scientists. These conditions were: 1) 
Examiner consider synthetic dataset with different noise correspond to different regions. The synthetic score is used to determine type of region. The performance metrics of Maria is used in combination with synthetic score for determining type of region.
However, Veeramachaneni discloses generating a suggestion based on the evaluation score and the determined region, wherein the suggestion includes one of (i) indicating that the at least one synthetic dataset is adequate on [0332-0339] “4. Demonstrate that the SDV meets its goals for usability and generalizability by using it to model 6 different datasets from a combination of sources: major software consulting firm, the relational database repository, and Kaggle™ 5. Evaluate the SDV's ability to synthesize data for sample databases by working a real-world complex relational database from our sponsor. Demonstrate that the SDV synthesizes data that be used  gives effective feedback regarding its application to real data does not interfere with the data scientists' ability to make accurate predictions does not produce confusing data that impedes the data scientists' progress. Thus, the SDV successfully builds generative models for relational databases, and is a viable solution for synthesizing data.”
[0308-0310] Hypothesis: There is a strong correlation between the synthetic score and the real score for each subject's work. A generally positive correlation means that the synthesized datasets give feedback that reasonably estimates the correct feedback. This implies that the synthesized data can be used successfully for data science. … Afterwards, a 2-sample paired t-test was performed on each submission's synthetic and accuracy score. The result showed that there was no significant difference between the two scores (t=0.812, p=0.427). This enables to conclude that A.sub.r(f)≈A.sub.s(f), a tighter constraint than we had initially set out to prove. It supports the belief that synthetic data provides adequate feedback to the data scientist. Hence, the data scientist can use the synthetic data to reasonably gage the usefulness of their work.”
[0327] “The synthetic output from SDV can replace original data for the purposes of data science. The results indicate that data scientists were able to work as effectively with the synthetic output as they were with the original data. In particular, a regression 
Maria and Veeramachaneni are analogous art because they are from the [insert the phrase “same field of endeavor” syntactic data analysis.
Before the effective filing date of the claimed invention, it would have been obvious to one of ordinary skill in the art, having the teachings of Maria and Veeramachaneni before him or her, to modify the method of Maria to include the validation and accuracy test feature of Veeramachaneni because this combination provide a synthetic data that could be used for replacing original data for the purposes of data science while maintaining the accuracy and performance. 
The suggestion/motivation for doing so would have been Veeramachaneni [0325] “The SDV was successful for each of the goals for generalizability, usability, and accuracy. The SDV can be applied generally to a variety of relational datasets. During the experimentation phase, the SDV was applied to Biodegradability, Mutagenesis, Airbnb, Rossmann, Telstra, and industrial datasets. The SDV was able to model the relational data automatically for each of these datasets, with no changes to the code.”
Therefore, it would have been obvious to combine Maria and Veeramachaneni to obtain the invention as specified in the instant claim(s).

Claim 40. (New) The system of claim 39, wherein the computer arrangement is configured to compare the first results with the second results using an analysis of variance procedure.
Maria: (page 530 section 4.2.1) “With the original dataset, Weka obtained the following performance metrics: MAE = 0.0342, precision = 0.96, recall = 0.96, and F-measure = 0.96; with the synthetic dataset, it obtained: MAE = 0.0177, precision = 0.987, recall =0.987, and F-measure =0.987. As it can be observed, the performance over the two datasets is similar. Moreover, we also evaluated the performance of the classification algorithm (correspond to evaluate the synthetic dataset by comparing first results to second result) when the entire synthetic dataset (correspond to second result from the training of the second model) is used for training and the entire real dataset (correspond to first result from the training of the first model) is used for testing, obtaining the following performance metrics (correspond to analysis of variance procedure), which are also similar: MAE = 0.0519, precision =0.949, recall =0.94, and F-measure =0.94. We also compared the attribute values themselves (average values, standard deviation, percentage of unique values, and number of items of each profile) (correspond to analysis of variance procedure), and the results obtained are also similar.”

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159.  See MPEP §§ 706.02(l)(1) - 706.02(l)(3) for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets 
Claims 21-40 are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1-20 of US Patent No. US 10635939 B2. Although the claims at issue are not identical, they are not patentably distinct from each other because they claims an obvious variant of the instant claims. See the comparison below.
This is a non-provisional nonstatutory double patenting rejection because the patentably indistinct claims have been patented.
Instant Application (16825040)

Related (US 10635939 B2)
21. A non-transitory computer-accessible medium having stored thereon computer-executable instructions for evaluating at least one synthetic dataset, wherein, when a computer arrangement executes the instructions, the computer arrangement is configured to perform procedures comprising: 
receiving at least one original dataset; receiving the at least one synthetic dataset; 
training at least one model using the at least one original dataset and the at least one synthetic dataset; 
generating a statistical correlation score based on the at least one synthetic dataset and the at least one original dataset; 

generating an evaluation score by evaluating the at least one synthetic dataset based on the training of the least one model, wherein the evaluation score includes the statistical correlation score; 



determining a region for the at least one synthetic dataset based on the evaluation score, wherein the region includes one of 
(i) a normal region where the at least one synthetic dataset is unlikely to contain synthetic data that is similar to original data within the at least one original dataset, 
(ii) a warning region where the at least one synthetic dataset at least one of 
(a) potentially contains the synthetic data that is similar to the original data or 3U.S. PATENT APPLICATION NO. TBA PRELIMINARY AMENDMENT 
(b) the synthetic data does not substantially match a schema of the at least one original dataset, or 
(iii) a red flag region where the at least one synthetic dataset is likely to contain the synthetic data that is similar to the original data; and 
generating a suggestion based on the evaluation score and the determined region, wherein the suggestion includes one of 
(i) indicating that the at least one synthetic dataset is adequate or
(ii) warning that the at least one synthetic dataset potentially contains information similar to the at least one original dataset.  

Similar to claims 35 and 39.


receiving at least one original dataset; 
receiving the at least one synthetic dataset; 
training at least one model using the at least one original dataset and the at least one synthetic dataset; 
generating a statistical correlation score based on the at least one synthetic dataset and the at least one original dataset; 

generating an evaluation score by evaluating the at least one synthetic dataset based on the training of the least one model, wherein the evaluation score includes (i) a statistical correlation score, 

determining a region for the at least one synthetic dataset based on the evaluation score, wherein the region includes one of 
(i) a normal region where the at least one synthetic dataset is unlikely to contain synthetic data that is similar to original data within the at least one original dataset, 
(ii) a warning region where the at least one synthetic dataset potentially contains the synthetic data that is similar to the original data, or 



(iii) a red flag region where the at least one synthetic dataset is likely to contain the synthetic data that is similar to the original data; and 
generating a suggestion based on the evaluation score and the determined region, wherein the suggestion includes one of 
(i) indicating that the at least one synthetic dataset is adequate or 
(ii) warning that the at least one synthetic dataset potentially contains information similar to the at least one original dataset.


Similar to claims 15 and 19.


2. The computer-accessible medium of claim 1, wherein the at least one model includes a first model and a second model, and wherein the computer arrangement is further configured to: train the first model using the at least one original dataset; and train the second model using the at least one synthetic dataset.


3. The computer-accessible medium of claim 2, wherein the computer arrangement is configured to evaluate the at least one synthetic dataset by comparing first results from the training of the first model to second results from the training of the second model.
24. The computer-accessible medium of claim 23, wherein the computer arrangement is configured to compare the first results to the second results using an analysis of variance procedure.  

Similar to claim 40.

4. The computer-accessible medium of claim 3, wherein the computer arrangement is configured to compare the first results to the second results using an analysis of variance procedure.

Similar to claim 20.
25. The computer-accessible medium of claim 22, wherein the computer arrangement is configured to compare the first results to the second results using a threshold procedure.  

5. The computer-accessible medium of claim 2, wherein the computer arrangement is configured to compare the first results to the second results using a threshold procedure.
26. The computer-accessible medium of claim 25, wherein the threshold procedure includes: summing first errors from the first results; summing second errors from the second results; and comparing the summed first errors to the summed second errors.  

6. The computer-accessible medium of claim 5, wherein the threshold procedure includes: summing first errors from the first results; summing second errors from the second results; and comparing the summed first errors to the summed second errors.
27. The computer-accessible medium of claim 26, wherein the computer arrangement is configured to compare the summed first errors to the summed second errors using a threshold criterion.  

7. The computer-accessible medium of claim 6, wherein the computer arrangement is configured to compare the summed first errors to the summed second errors using a threshold criterion.
28. The computer-accessible medium of claim 25, wherein the threshold procedure includes determining a further statistical correlation based on a plurality of covariance matrices.  

8. The computer-accessible medium of claim 5, wherein the threshold procedure includes determining a further statistical correlation based on a plurality of covariance matrices.
29. The computer-accessible medium of claim 22, wherein the first model is equivalent to the second model.  

9. The computer-accessible medium of claim 2, wherein the first model is equivalent to the second model.


10. The computer-accessible medium of claim 1, wherein the at least one model is a classification model.
31. The computer-accessible medium of claim 21, wherein the computer arrangement is further configured to generate the at least one synthetic dataset.  

11. The computer-accessible medium of claim 1, wherein the computer arrangement is further configured to generate the at least one synthetic dataset.
32. The computer-accessible medium of claim 31, wherein the computer arrangement is configured to generate the at least one synthetic dataset based on the at least one original dataset.  

12. The computer-accessible medium of claim 11, wherein the computer arrangement is configured to generate the at least one synthetic dataset based on the at least one original dataset.
33. The computer-accessible medium of claim 21, wherein the computer arrangement is further configured to generate at least one further synthetic dataset based on (i) the at least one synthetic dataset and (ii) the evaluation of the at least one synthetic dataset.  

13. The computer-accessible medium of claim 1, wherein the computer arrangement is further configured to generate at least one further synthetic dataset based on (i) the at least one synthetic dataset and (ii) the evaluation of the at least one synthetic dataset.
34. The computer-accessible medium of claim 21, wherein the at least one original dataset and the at least one synthetic dataset include at least one of (i) biographical information regarding a plurality of customers or (ii) financial information regarding the plurality of customers.  

14. The computer-accessible medium of claim 1, wherein the at least one original dataset and the at least one synthetic dataset include at least one of (i) biographical information regarding a plurality of customers or (ii) financial information regarding the plurality of customers.
36. The method of claim 35, further comprising generating at least one further synthetic dataset based on the evaluation score and the at least one synthetic dataset.  

16. The method of claim 15, further comprising generating at least one further synthetic dataset based on the evaluation score and the at least one synthetic dataset.
37. The method of claim 36, further comprising training the at least one second model based on the at least one further synthetic dataset.  

17. The method of claim 16, further comprising training the at least one second model based on the at least one further synthetic dataset.




18. The method of claim 17, further comprising evaluating the at least one further synthetic dataset based on the training of the at least one second model on the at least one further synthetic dataset.


As shown in the table above where limitations are very similar with each other are paired. The differences among them are not significant. The scope and content of the limitations are similar. 

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CHUEN-MEEI GAN whose telephone number is (469)295-9127.  The examiner can normally be reached on Monday-Friday 9:00 am to 4:00 pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Rehana Perveen can be reached on (571) 272-3676.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for 






/CHUEN-MEEI GAN/Primary Examiner, Art Unit 2148