DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
The present application was filed on May 17, 2019. 
This office action is in response to amendments and/or remarks filed on January 12, 2022. In the current Amendment, claims 1-6, 9-13, and 16-20 are amended. No claims are cancelled. Claims 1-20 are pending. 
Claim Interpretation
The recitation of “computer-readable storage medium” in claim 16 is interpreted as being non-transitory as per the paragraph below: 
Specification [0084]: A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media ( e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 2, 7 – 9, and 14 – 16  are rejected under 35 U.S.C. 103 as being unpatentable over Fan et al. (“Communication Efficient Coreset Sampling for Distributed Learning”) in view of Zhang et al. (“Approximate Clustering on Distributed Data Streams”), further in view of Travizano et al. (“Wibson: A Decentralized Data Marketplace”), further in view of Jia et al. (“Towards Efficient Data Valuation Based on the Shapley Value”), further in view of Li et al. (“A Theory of Pricing Private Data”). 
Regarding Claim 1, 
Fan teaches 
A method comprising:
receiving, at a data broker from a holder of a first corpus configured for use in training a machine learning application, a coreset for the first corpus, the coreset sharing a dimensionality with the first corpus; (Page 1, Section I: “For a large scale data set, there may be only a small subset of data that is informative to the learning due to redundancy. We construct a coreset [7], namely a small and weighted subset of the data, to approximate the full dataset.” teaches that a coreset shares dimensionality with the large data set (first corpus); Page Page 3, Section IV: “Therefore, the master node could construct the global coreset by collecting and merging the local coresets generated by distributed nodes.” teaches receiving a corset for the data (corpus of documents) at a master node. The master node is a data broker because it receives data from the distributed nodes; Page 1, Section I: “The coreset construction in this paper is the generalized framework that bridges the coreset with supervised learning. It shows that, by mining the structure of data using unsupervised learning, the efficiency of supervised learning can be significantly improved regarding the sample complexity.” teaches that the corset (and the associated corpus of data) is used for supervised learning (machine learning); Page 3: “In this section, we evaluated the empirical performance of the proposed sampling algorithm on 2 middle size datasets of varying type, Web, CovType1 and one large dataset Yahoo!, as summarized in Table 1.” teaches that the corpus of data is a collection of documents or text based data)

Fan does not appear to explicitly teach: 
transmitting, from the data broker to a set of data providers, the coreset; 
receiving, at the data broker from a first data provider of the set of data providers, an incremental value of a second corpus with respect to the first corpus, wherein: 
the incremental value is calculated based at least in part on the coreset; and 
the incremental value indicates an expected performance of a machine learning model trained using both the first and second corpora, as compared to a machine learning model trained using only the first corpus; 
transmitting, from the data broker to the holder of the first corpus, the incremental value;
receiving, at the data broker from the holder of the first corpus, a request to receive the second corpus;
receiving, at the data broker from the first data provider, the second corpus;
validating, by the data broker, the incremental value of the second corpus; and
transmitting, from the data broker to the holder of the first corpus, the second corpus.

However, Zhang teaches: 
transmitting, from the data broker to a set of data providers, the coreset; (Page 1134: “At each site, we create an EH-summary which is a multilevel structure of coresets over the stream. EH-summary is constructed in the similar fashion to that of building an online exponential histogram. Assume that the stream arriving at a site is P. Whenever a block of B data points comes, we perform a coreset computation and append the points in the coreset to the stream one level above… For all levels I > 1, whenever two coresets Cj, Cj+1 come in, we merge them and compute another coreset C121 on top of the merged set and send to level I + 1.” teaches transmitting the coreset to the data stream (set of data providers))
Fan and Zhang are analogous art because they are directed to clustering algorithms to create coresets. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Zhang’s method for Approximate clustering on distributed data streams into Fan’s method for coreset sampling for distributed learning with a motivation to “[compute] efficient algorithms to compute bounded-error approximate k-median” (Zhang, Page 1138).

The combination of Fan and Zhang does not appear to explicitly teach: 
receiving, at the data broker from a first data provider of the set of data providers, an incremental value of a second corpus with respect to the first corpus, wherein: 
the incremental value is calculated based at least in part on the coreset; and 
the incremental value indicates an expected performance of a machine learning model trained using both the first and second corpora, as compared to a machine learning model trained using only the first corpus; 
transmitting, from the data broker to the holder of the first corpus, the incremental value;
receiving, at the data broker from the holder of the first corpus, a request to receive the second corpus;
receiving, at the data broker from the first data provider, the second corpus;
validating, by the data broker, the incremental value of the second corpus; and
transmitting, from the data broker to the holder of the first corpus, the second corpus.

However, Travizano teaches: 
receiving, at the data broker from the holder of the first corpus, a request to receive the second corpus; (Page 4, Section 3.2: “Data Buyer B chooses the set of Data Sellers T from whom he wants to buy and adds them to the Data Order contract” teaches that the Data buyer requests to receive the data (second corpus) from the data sellers; Fig. 1 teaches that the notary (data broker) manages the smart contract)

    PNG
    media_image1.png
    655
    901
    media_image1.png
    Greyscale


receiving, at the data broker from the first data provider, the second corpus; (Page 4, Section 3.2: “Data Sellers Si ∈ T who have their offer approved upload the data file (encrypted with the public key of the Data Buyer PKB) to the requested address UB.” teaches receiving the data files (second corpus) form the data sellers; Fig. 1 teaches that the notary (data broker) manages the smart contract and can receive the encrypted data)

transmitting, from the data broker to the holder of the first corpus, the second corpus. (Page 5, Section 3.2: “With the Notary’s signed certificate, the Data Buyer can close the given Data Response (a.k.a. Data Transaction). The contract will verify this certificate and transfer the money to the Data Seller in scenarios (a) and (b), or to the Data Buyer in scenario (c).” teaches that the data buyer can close the data transaction (indicating that they have received the valid data from the data seller))

Fan, Zhang, and Travizano are analogous art because they are directed to distributed data for use with machine learning. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Travizano’s decentralized data marketplace into Fan’s method for coreset sampling for distributed learning as modified by Zhang with a motivation to “provides individuals a way to securely and anonymously sell information in a trusted environment” (Travizano, Page 1).

The combination of Fan, Zhang, and Travizano does not appear to explicitly teach: 
receiving, at the data broker from a first data provider of the set of data providers, an incremental value of a second corpus with respect to the first corpus, wherein: 
the incremental value is calculated based at least in part on the coreset; and 
the incremental value indicates an expected performance of a machine learning model trained using both the first and second corpora, as compared to a machine learning model trained using only the first corpus; 
transmitting, from the data broker to the holder of the first corpus, the incremental value;
validating, by the data broker, the incremental value of the second corpus; and

However, Jia teaches: 
receiving… an incremental value of a second corpus with respect to the first corpus, wherein: 
the incremental value is calculated based at least in part on the coreset; and the incremental value indicates an expected performance of a machine learning model trained using both the first and second corpora, as compared to a machine learning model trained using only the first corpus; (Page 3: “Consider a dataset D = {zi} N i=1 containing data from N users. Let U(S) be the utility function, representing the value calculated by the additive aggregation of {zi}i∈S and S ⊆ I = {1, · · · , N}. Without loss of generality, we assume throughout that U(∅) = 0. Our goal is to partition Utot , U(I), the utility of the entire dataset, to the individual users; more formally, we want to find a function that assigns to user i a number s(U, i) for a given utility function U. We suppress the dependency on U when the utility is self-evident and use si to represent the value allocated to user i. The SV [27] is a classic concept in cooperative game theory to attribute the total gains generated by the coalition of all players. Given a utility function U(·),  the SV for user i is defined as the average marginal contribution of zi to all possible subsets of D = {zi}i∈I formed by other users” teaches using a Shapley value (incremental value) to estimate the utility (expected performance) of a user’s data (second corpus) for training a machine learning model)

Fan, Zhang, Travizano, and Jia are analogous art because they are directed to machine learning. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Jia’s Shapley value into the Fan’s method for coreset sampling for distributed learning as modified by Zhang/Travizano with a motivation to “fairly allocate the revenue generated by a ML model to the data contributors” (Jia, Page 1).

The combination of Fan, Zhang, Travizano, and Jia does not appear to explicitly teach: 
transmitting, from the data broker to the holder of the first corpus, the incremental value; 
validating, by the data broker, the incremental value of the second corpus; and

However, Li teaches: 
transmitting, from the data broker to the holder of the first corpus, the incremental value; (Fig. 1 teaches transmitting the price of the data from the market maker (data broker) to the data buyer (holder of the first corpus))
validating, by the data broker, the incremental value of the second corpus; and (Page 4: “The market maker is trusted by the buyer and by each of the data owners. He collects data from the owners and sells it in the form of queries. When a buyer decides to purchase a query, the market maker collects payment, computes the answer to the query, adds noise as appropriate, returns the result to the buyer, and finally distributes individual payments to the data owners. The market maker may retain a fraction of the price as profit.” and Page 5: “The market maker enters two contracts: (1) It promises to answer the buyer’s queries according to an agreed price π, (Section 3) and (2) it promises to compensate the data owners with a micropayments μi(ε) whenever they suffer a privacy loss ε in response to a buyer’s query (Section 5).” teaches that the market maker (data broker) validates the price of the data for a loss of privacy and compensates the data owners with micropayments for this loss)
Fan, Zhang, Travizano, Jia, and Li are analogous art because they are directed to data for machine learning models. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Li’s validation of privacy loss into Fan’s method for coreset sampling for distributed learning as modified by Zhang/Travizano/Jia with a motivation to compensate the end user for private data (Li, Page 1).

Regarding Claim 2, 
The combination of Fan, Zhang, Travizano, Jia, and Li teaches The method of claim 1, 
Fan further teaches:
wherein validating the incremental value of the second corpus comprises: determining a first cost of clustering the second corpus using the coreset; (Page 3, Section IV: “The cost for robustness is 𝑟 − 1 extra small coresets. We will prove in Theorem 3 that the extra 𝑟 − 1 coresets are small… Theorem 3: Suppose the base function ℎ(𝑥) is 𝜆-Lipschitz and 𝜂-bounded. The distribution for each sample is 1 ∣𝒟∣ . If ∣𝒟∣ is large enough such that ∣𝒟∣ > 𝑒max (2𝜆,4𝜂2) , then 𝑚𝑛 is upper bounded by 𝑂(𝑒max (2𝜆,4𝜂2) ).” teaches determining a cost of clustering the data using the coreset)

determining a second cost of clustering the second corpus using a doubled modification of the coreset; and (Page 3, Section IV: “Proof From Theorem 1, we have 𝑚𝑛 = [ 1 1 ∣𝒟∣ + ∑𝐾 𝑘=1(𝑊𝑤,𝑛 𝑘 𝑑𝑋𝑛 + 𝑊𝑤,𝑛− 𝑘 𝑒−4𝜂2 ) ] where 𝑑𝑋𝑛 = 𝑒−𝜆∣∣𝑋¯ 𝑤,𝑛 𝐺,𝑘 −𝑋𝑛∣∣2 . For normalized vector 𝑋𝑛, we have ∣∣𝑋¯𝐺,𝑖 − 𝑋𝑛∣∣ ≤ 2. As ∑ 𝐾 𝑘=1 (𝑊𝑤,𝑛 𝑘 + 𝑊𝑤,𝑛− 𝑘 ∣)=1 − 1 ∣𝒟∣ 𝑚𝑛 is further upper bounded by 𝑚𝑛 ≤ 1 1 ∣𝒟∣ + 𝑒− max (4𝜆,4𝜂2) Given the assumption ∣𝒟∣ > 𝑒max (4𝜆,4𝜂2), we have the desired upper bound for 𝑚𝑛 as 𝑂(𝑒max (4𝜆,4𝜂2) ).” teaches determining a second cost of clustering the corpus, the second cost uses a double modification of the coreset because the initial cost was bounded by 𝑂(𝑒max (2𝜆,4𝜂2) ), while the second cost was bounded by 𝑂(𝑒max (4𝜆,4𝜂2) ), double the lipschitz constant (𝜆) of the coreset)

determining a difference between the first cost and the second cost. (Page 3, Section IV: “The upper bound for the coreset size depends on ℎ(𝑥)’s lipschitz constant 𝜆 and maximum value 𝜂 (If the data is not the normalized vector, then it is also related to the dimension 𝑑 for 𝑋). Generally speaking, if ℎ(𝑥) has broader range and sharper derivative, which capture the complexity of ℎ(𝑥), then larger sample size ∣𝑀∣ is required. If the base function is too complicated that exceeds the descriptive capacity of 𝒟, the coreset will approach 𝒟 itself.” teaches comparing the difference between the first and second cost and determining that the difference is dependent on ℎ(𝑥)’s lipschitz constant 𝜆 and maximum value 𝜂)

Regarding Claim 7, 
The combination of Fan, Zhang, Travizano, Jia, and Li teaches The method of claim 1,
Fan further teaches: 
further comprising generating the coreset by performing k-means clustering on the first corpus, wherein the coreset is of smaller size than the first corpus. (Page 1, Section II: “The whole dataset 𝒟 is clustered into 𝐾 clusters based on feature 𝑋 using k-means clustering.” and Page 3, Section III: “The proposed coreset construction algorithm is computationally efficient. The clustering can be obtained efficiently via the k-means++ algorithm in 𝑂(∣𝒟∣𝐾) time [13].” teaches that the coreset is generated by performing k-means clustering on the dataset (first corpus); Page 1, Section I: “For a large scale data set, there may be only a small subset of data that is informative to the learning due to redundancy. We construct a coreset [7], namely a small and weighted subset of the data, to approximate the full dataset.” teaches that the coreset is smaller than the dataset)

Regarding Claim 8, 
The combination of Fan, Zhang, Travizano, Jia, and Li teaches The method of claim 1,
Travizano further teaches: 
further comprising: based on transmitting the second corpus to the holder of the first corpus, establishing an exchange has occurred between the holder of the first corpus and the first data provider; and  (Page 5, Section 3.2: “With the Notary’s signed certificate, the Data Buyer can close the given Data Response (a.k.a. Data Transaction). The contract will verify this certificate and transfer the money to the Data Seller in scenarios (a) and (b), or to the Data Buyer in scenario (c).” teaches that the data buyer can close the data transaction (indicating that they have received the valid data (second corpus) from the data seller); Page 3, Section 3.2: “The Data Buyer B creates a Data Order query DO = <A, R, PKB, UB, ma,tc>… The DO includes: (i) intended audience A (filter of potential sellers), (ii) data requested R, (iii) the Data Buyer’s public key PKB, (iv) public URL to upload Data Seller’s responses and encrypted data via HTTPS post UB, (v) minimum audit budget ma, (vi) terms and conditions of data use tc.” teaches establishing a data order; Page 5, Section 3.2: “Once the Data Buyer receives the personal information, the next step is to close the transaction and transfer the tokens accordingly” teaches that the Data buyer closes the transaction over the data order, signifying that the exchange of data has occurred between the data buyer (holder of first corpus) and data seller (data provider))

recording the exchange in a blockchain. (Page 1: “Our aim is for Wibson to be a blockchain-based, decentralized data marketplace that provides individuals a way to securely and anonymously sell information in a trusted environment.” teaches that the exchange is recorded in a blockchain)

Fan, Zhang, and Travizano are analogous art because they are directed to distributed data for use with machine learning. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Travizano’s decentralized data marketplace into Fan’s method for coreset sampling for distributed learning as modified by Zhang with a motivation to “provides individuals a way to securely and anonymously sell information in a trusted environment” (Travizano, Page 1).

Regarding Claim 9, 
This claim recites A system, which performs a plurality of operations as recited by the method of claim 1, and has limitations that are similar to the method of claim 1, thus is rejected with the same rationale applied against claim 1.
Fan further teaches: 
A system, comprising: a processor; and a memory coupled to the processor, wherein the processor is configured to: (Page 3: “In this section, we evaluated the empirical performance of the proposed sampling algorithm on 2 middle size datasets of varying type, Web, CovType1 and one large dataset Yahoo!, as summarized in Table 1” suggest a computer based implementation because these datasets are obtained from the internet and algorithms such as AdaBoost, SmoothBoost, and AgnBoost are performed on these datasets (Page 4, Section 5B))

Regarding Claim 14, 
This claim recites The system of claim 9, which performs a plurality of operations as recited by the method of claim 7, and has limitations that are similar to the method of claim 7, thus is rejected with the same rationale applied against claim 7.

Regarding Claim 15, 
This claim recites The system of claim 9, which performs a plurality of operations as recited by the method of claim 8, and has limitations that are similar to the method of claim 8, thus is rejected with the same rationale applied against claim 8.

Regarding Claim 16, 
This claim recites A computer program product…, which performs a plurality of operations as recited by the method of claim 1, and has limitations that are similar to the method of claim 1, thus is rejected with the same rationale applied against claim 1.
Fan further teaches: 
A computer program product for exchanging corpora via a data broker, the computer program product comprising: a computer-readable storage medium having computer-readable program code embodied therewith, the computer-readable program code executable by one or more computer processors to: (Page 3: “In this section, we evaluated the empirical performance of the proposed sampling algorithm on 2 middle size datasets of varying type, Web, CovType1 and one large dataset Yahoo!, as summarized in Table 1” suggest a computer based implementation because these datasets are obtained from the internet and algorithms such as AdaBoost, SmoothBoost, and AgnBoost are performed on these datasets (Page 4, Section 5B))

Response to Arguments
Regarding 35 U.S.C. 103: 
Applicant’s argument: Page 2 of Remarks
Response: 
Applicant’s arguments have been fully considered but are not persuasive. The combination of Fan, Zhang, Travizano, Jia, and Li teaches these amended limitations. Please see pages 3-10 of this office action for a detailed analysis of independent claim 1. 

Allowable Subject Matter
Claims 3 – 6, 10 – 13, and 17 – 20 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHOUN ABRAHAM whose telephone number is (571)272-8144. The examiner can normally be reached Mon - Fri 08:00-16:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar can be reached on (571) 272-7796. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/S.J.A./Examiner, Art Unit 2125                                                                                                                                                                                                        
/BRIAN M SMITH/Primary Examiner, Art Unit 2122