DETAILED ACTION
This is the response to applicant’s amendment action regarding application number 16/671,302, filed November 1, 2019.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendments
The amendment filed January 5, 2022 and corresponding supplemental amendment January 19, 2022 have been entered. The submission filed on January 19, 2022 is an incremental update to the earlier amendment filed January 5, 2022 (i.e., fix a typographical error in Claim 9), with all earlier annotations and amended claim language presented in the January 5, 2022 version reset to “Previously Presented” status. Applicant’s arguments in the submission filed on January 19, 2022 also reference back to the corresponding arguments filed on January 5, 2022. Hence, for purposes of examination, Applicant’s submission filed January 5, 2022 containing Remarks and the annotations and amendments provided in the Amendments to the Claims will be referenced hereafter (with the incorporation of the fix for the typographical error in Claim 9 from the January 19, 2022 submission). Examiner acknowledges receipt of Amendments to Application 16/671,302, which include: Amendments to the Claims, and Remarks containing Applicant’s amendments. 
Regarding Applicant’s Remarks, Examiner has acknowledged Claims 1-3 and 6-9 have been amended. Examiner has acknowledged original Claims 4-5 have been canceled, and new Claims 10-15 have been added. Claims 1-3 and 6-15 remain pending in the application. 
Regarding Applicant’s Remarks for Claims 1-9 under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, Examiner acknowledges Applicant has amended the claims to remove the respective 112(f) claim interpretation for “… plurality of learning units configured to …” identified in Claims 1-2 and 8 (which rendered the respective claims and their corresponding dependent claims as being indefinite under §112(b)), and therefore the respective §112(b) indefiniteness rejections previously set However, Examiner notes that Applicant has not addressed the indefiniteness issue identified in Claim 2 (“wherein the plurality of learners are configured to cause the plurality of model memories to store the same data of the learned decision tree”), and therefore the respective §112(b) rejection previously set forth in the Non-Final Office Action mailed October 5, 2021 for Claim 2 is still maintained. Examiner further notes that while Applicant has cancelled Claims 4 and 5 (thus resolving the respective §112(b) indefiniteness issues in those claims), certain aspects of now-cancelled Claim 4 are now present in amended Claim 1 (including the earlier indefiniteness issue identified in the recited claim limitation: “wherein the discriminating unit is configured to update a sample weight by obtaining a sum total of leaf weights of leaves to which the learning data stored in the data memory branch in the decision tree stored in the model memory”), and as such, the respective §112(b) rejection that is now present in amended Claim 1 is still maintained, and will be identified in the relevant section indicated below. 

Response to Arguments
Examiner acknowledges receipt of Arguments to Application 16/671,302, which include: Remarks containing Applicant’s arguments. 
Regarding Applicant’s Remarks for Claim Interpretation under 35 U.S.C. 112(f), Examiner notes Applicant has amended the claims to remove the 112(f) claim interpretation for the terms “… plurality of learning units configured to …”, “… discriminating unit configured to …”, “… performance calculator configured to …” identified in Claims 1-8, and hence the invocation of 112(f) claim interpretation for those terms will be withdrawn. 
While Examiner acknowledges that MPEP 2181(I)(A) does not explicitly state the term “managers” as being a generic placeholder term, MPEP 2181(I)(A) also states “Note that there is no fixed list of generic placeholders that always result in 35 U.S.C. 112(f) interpretation, and likewise there is no fixed list of words that always avoid 35 U.S.C. 112(f) interpretation.”. Additionally, Examiner points out that MPEP 2181(II)(B) further provides an example of the term “manager” representing a generic placeholder term (in the context of “access control manager”), such that the term results in a “means for assigning” claim interpretation. Given that the MPEP does provide an example where the term “managers” is considered as a generic placeholder term, Applicant’s arguments that the term “managers” is not a generic placeholder term simply because MPEP 2181(I)(A) does not explicitly recite it as a generic placeholder (and hence should not be subject to 112(f) claim interpretation) is not persuasive, and hence the existing 112(f) claim interpretation for this particular term will be maintained.
Regarding Applicant’s Remarks for Claims 1-3 and 8-9 under 35 U.S.C §103 as being unpatentable over Chen et al., XGBoost: A Scalable Tree Boosting System, June 10 2016 [hereafter referred as Chen] in view of Owaida et al., Scalable Inference of Decision Tree Ensembles: Flexible Design for CPU-FPGA Platforms, 2017 [hereafter referred as Owaida]; for Claim 4 under 35 U.S.C. 103 as being unpatentable over Chen in view of Owaida as applied to Claim 3; in further view of Nishiyama et al., U.S. PGPUB 2011/0178976, published 7/21/2011 [hereafter referred as Nishiyama]; for Claim 5 under 35 U.S.C. 103 as being unpatentable over Chen in view of Owaida, in further view of Nishiyama as applied to Claim 4; in even further view of Ke et al., LightGBM: A Highly Efficient Gradient Boosting Decision Tree, (NIPS 2017) [hereafter referred as Ke]; and for Claims 6 and 7 under 35 U.S.C. 103 as being unpatentable over Chen in view of Owaida, in further view of Nishiyama as applied to Claim 4; in even further view of Kamiya et al., WO2020090413, priority to JP2018-025795 filed 10/31/2018 [hereafter referred as Kamiya], Examiner acknowledges Applicant’s arguments and have considered them, and have found them to be not persuasive. Hence the existing U.S.C. 35 §103 rejections are still maintained, and the updated claim mappings according to the applicant’s amended claims are provided in the sections indicated below.
Regarding applicant’s Remarks on pp.7-8:
“In response, Applicant has amended claim 1 to include features from claim 5 and other features. Applicant respectfully submits that amended independent claims 1 and 9 recite novel features not taught or rendered obvious by the applied references. 
By way of background, independent claim 1 is directed to a learning device configured to perform learning of a decision tree by gradient boosting, the learning device including, inter alia: 
a plurality of learners configured to perform learning of the decision tree using learning data divided to be stored in a plurality of data memories; and 
a plurality of model memories each configured to store data of the decision tree learned by a corresponding one of the plurality of learners; a discriminator configured to read out each feature amount of the learning data from the data memory, and based on a branch condition for a node of the decision tree, the branch condition being derived based on the feature amount, discriminate a lower node to which the learning data read out from the data memory is to branch from the node; wherein the discriminator is configured to update a sample weight by obtaining a sum total of leaf weights of leaves to which the learning data stored in the data memory branch in the decision tree stored in the model memory; and 
wherein the discriminator is configured to update gradient information of the learning data based on a sum total of leaf weight.
Independent claim 9, although varying in claim scope and statutory class, recites substantially similar features as claim 1. Thus, the arguments presented below with respect to claim 1 are also applicable to independent claim 9. 
Turning now to the applied references, Applicant respectfully submits that the cited references fail to teach or suggest that "the discriminator is configured to update gradient information of the learning data based on a sum total of leaf weight," as recited in Applicant's claim 1.
Page 30 of the Office Action, in the rejection of former claim 5, acknowledges that Chen, Owaida, and Nishiyama fail to teach the above features. In an attempt to cure the above-noted deficiency, the Office Action cites Ke. Page 2, second paragraph of Ke merely states: 
While there is no native weight for data instance in GBDT, we notice that data instances with different gradients play different roles in the computation of information gain. In particular, according to the definition of information gain, those instances with larger gradients (i.e., under-trained instances) will contribute more to the information gain. 
Therefore, when down sampling the data instances, in order to retain the accuracy of information gain estimation, we should better keep those instances with large gradients (e.g., larger than a pre-defined threshold, or among the top percentiles), and only randomly drop those instances with small gradients. We prove that such a treatment can lead to a more accurate gain estimation than uniformly random sampling, with the same target sampling rate, especially when the value of information gain has a large range. 
This paragraph discusses large gradients, but fails to describe updating gradient information of the learning data based on a sum total of leaf weight, as recited in Applicant's claim 1.” 
Examiner has considered this argument, and finds the argument to be not persuasive. Examiner notes that the bulk of the Applicant’s arguments are directed to the newly added claim limitations not previously presented that are now recited in the respective independent claims, where these new claim limitations necessitates further examination and re-evaluation of the amended and related original claims. Furthermore, Examiner notes that Applicant’s argument is based on their own application of prior art from a limitation in the now-cancelled Claim 5 to a newly-added limitation in amended independent Claim 1 that was not previously presented. According to the Non-Final Office Action mailed October 5, 2021, the original claim limitation in Claim 5 (“wherein the discriminating unit is configured to update gradient information of the learning data corresponding to the updated sample weight based on the sample weight”) was identified as rendering the claim as being indefinite, as it is unclear whether the term “the updated sample weight based on the sample weight” was referencing the sum total of leaf weights, the updated gradient information of the learning data, or a different updated value, and hence for purposes of examination, it was interpreted according to the context of the prior art, in which the Ke reference was used to teach an algorithm that random-sampled instances with small gradients in order to amplify and re-calculate those gradients in order to determine an optimal information gain for node data splits, with the interpretation of the gradients as being the “the updated sample weight based on the sample weight”. Examiner has noted that Applicant’s newly-added limitation in amended independent Claim 1 has clarified the indefiniteness issue by replacing the phrase “… corresponding to the updated sample weight based on the sample weight” with “… based on a sum total of leaf weight” (i.e., “wherein the discriminator is configured to update gradient information of the learning data based on a sum total of leaf weight”), thus making the claim limitation different from the alleged similar claim limitation in now-cancelled Claim 5, such that this newly-added claim limitation necessitates further examination and re-evaluation of the amended and related original claims from the Examiner.
Regarding applicant’s Remarks on p.8:
“Further, Chen describes parallelization of learning and increasing efficiency of data access using the cash memory, but relates to subject matter regarding algorithms or software programing using a CPU and does not describe increasing the speed of learning using hardware as in Applicant's disclosure. See Chen at p.8, col.1, 6th paragraph (Section 6.2 Dataset and Setup) and p.3, col.2, Section 3.1 Basic Exact Greedy Algorithm, 1st paragraph. 
Specifically, Chen does not disclose a method that divides learning data to be stored in different memories of a plurality of mechanisms, and calculates in parallel and integrates gradient information for each, to determine a node or a leaf that is an element of a decision tree model. 
Further, regarding the parallelization of the learning, Chen describes parallelization for each feature dimension, but Applicant's claimed learning device divides the data in a single feature dimension for parallelization, which is distinct from Chen. 
Owaida describes speedup of prediction processing of a gradient boosting decision tree using hardware. Owaida describes the feature that divides data for prediction processing to be stored in a plurality of memories. 
However, the claimed learning device performs not only prediction processing, but also learning processing of a gradient boosting decision tree. The learning processing (decision of a node, calculation of a leaf weight, update of gradient information of learning data) present in the learning device of claim 1 is not described in Owaida. 
Thus, Applicant respectfully submits that independent claims 1 and 9 (and all claims depending thereon) patentably distinguish over Chen, Owaida, Nishiyama, and Ke. Further, Applicant respectfully submits that Kamiya fails to cure the above-noted deficiencies of Chen, Owaida, Nishiyama, and Ke. 
Accordingly, Applicant respectfully requests that the rejections under 35 U.S.C. § 103(a) be withdrawn.”
Examiner has considered this argument, and finds the argument to be not persuasive. Examiner notes that Applicant acknowledges that the Chen reference describes parallelization for decision tree learning. Under its broadest reasonable interpretation, Applicant’s original independent claim limitations of “a plurality of learning units configured to perform learning of the decision tree using learning data divided to be stored in a plurality of data memories; and a plurality of model memories each configured to store data of the decision tree learned by a corresponding one of the plurality of learning units” broadly recite dividing learning data among a plurality of data memories, with a plurality of model learning units” identified in the original independent claim limitation are interpreted as broadly reciting general processing elements that receive input data for decision tree learning (which is divided to be stored in a plurality of data memories) as well as perform processing of decision tree classification (where the data of the decision tree learned (i.e., learned classification data) is stored in a plurality of model memories), using a common hardware architecture to support both decision tree learning and classification functions. Examiner notes that Applicant’s latest amendments to the independent claims are also directed to this same concept, where the claim limitations reciting both “a plurality of learners” and “a discriminator” are directed to components that perform aspects of decision tree learning and classification processing. Examiner also notes that Applicant further acknowledges that the Owaida reference describes speedup of prediction processing (i.e., classification processing) using hardware. As indicated in the Non-Final Office Action mailed October 5, 2021, both the Chen and Owaida references teach hardware using gradient boosting decision trees, where the combination of Chen and Owaida references teach decision tree learning and decision tree classification steps on hardware containing a plurality of processing elements and a plurality of memories. The motivation to combine is taught in Owaida, since a hybrid CPU hybrid CPU-FPGA architecture allows the flexibility to support large and deep ensemble decision trees, as well as providing the necessary datapath support to accelerate and parallelize classification processing, which in turn provides improved performance and speedup over pure-CPU approaches (Owaida col.2 2nd-6th paragraphs (Section I. Introduction)). Hence, based on the above reasons and the provided motivation to combine, Applicant’s prior art argument is not persuasive, and the prior art §103 rejection is maintained. 

Claim Objections
Claims 1, 14 and 15 are objected to 
because of the following informalities: 
Claim 1: The limitation “wherein the discriminator is configured to update a sample weight by obtaining a sum total of leaf weights of leaves to which the learning data stored in the data memory branch in the decision tree stored in the model memory” contains the phrase “the data memory branch in the decision tree”, which was originally identified in the Non-Final Office Action mailed October 5, 2021 as being unclear when it was originally present in now-cancelled Claim 4, since that phrase (“the data memory branch in the decision tree”) is not a term of art, and the applicant’s specification fails to disclose any description for this term. The Applicant also did not address nor resolve this issue in their January 5, 2022 and January 19, 2022 amendment submissions, when this same phrase was originally present in now-cancelled Claim 4 (and now reflected in amended independent Claim 1). Hence, Examiner proposes that this limitation should be corrected as “wherein the discriminator is configured to update a sample weight by obtaining a sum total of leaf weights of leaves to which the learning data stored indicating a relationship between the learning data and the node/leaf data that is stored and branched for the decision tree. Appropriate correction is required.
Claim 14: A typographical error in the following claim limitation, where the term “each of the manager” should be corrected as “each of the manager[s]” as follows: “the learning device further comprises a plurality of managers each corresponding to one of the plurality of learners, each of the manager[s] calculating a third address related to a storage destination of the learning data corresponding to a second node as a next node of the first node using the first address and the second address output from the learner.” Appropriate correction is required.
Claim 15: The limitation “wherein the data memory is configured to store the training data divided within the features” should be corrected as “wherein the data memory is configured to store the training data divided within a plurality of features” to resolve the antecedent issue. Appropriate correction is required.

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitations are:
Claim 8: “a plurality of managers … each of the managers being configured to calculate …”
Because these claim limitations are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, they are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:

The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 2 and 10 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite 
for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Regarding amended Claim 2 and new Claim 10, 
Both claims recite the limitation "wherein the plurality of learners are configured to cause the plurality of model memories to store the same data of the learned decision tree", which renders the claim as being indefinite, as it is unclear whether the term “the same data of the learned decision tree” refers to the existing learned data of the decision tree stored in the plurality of model memories recited in the respective independent claims, or whether it refers to the storing of new data (that is being learned) into the model memories. For the purposes of examination, this claim limitation will be interpreted as broadly reciting that each decision tree processing is duplicated and processed in a pipeline parallelism fashion (i.e., each model memory stores the learned data that is from the processing of a particular (same) tree vs. processing different trees).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.

3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Claims 1-3 and 8 are rejected under 35 U.S.C. 103 as being unpatentable over 
Chen et al., XGBoost: A Scalable Tree Boosting System, June 10 2016 [hereafter referred as Chen] in view of Owaida et al., Scalable Inference of Decision Tree Ensembles: Flexible Design for CPU-FPGA Platforms, 2017 [hereafter referred as Owaida], in further view of Nishiyama et al., U.S. PGPUB 2011/0178976, published 7/21/2011 [hereafter referred as Nishiyama].
Regarding amended Claim 1, 
Chen teaches
(Currently Amended) A learning device configured to perform learning of a decision tree by gradient boosting, the learning device comprising: 
a plurality of learners configured to perform learning of the decision tree using learning data divided to be stored in a plurality of … memories (Examiner’s note: Under its broadest reasonable interpretation, the term “a plurality of learners” broadly recites a plurality of components performing steps related to decision tree learning, and the term “learning data” is interpreted as encompassing aspects of the input data instances (containing feature information) as well as data stored in the decision tree nodes. Chen teaches performing learning of a gradient boosting decision tree using a plurality of processor cores, with the input data stored in memory blocks for performing a greedy algorithm for finding split point values (Chen p.3 Algorithm 1), where this algorithm corresponds to decision tree learning as it finds the best split points for the decision tree (Chen p.3 col.1 4th paragraph (Section 2.2 Gradient Tree Boosting) and p.3 col.2 Section 3.1 Basic Exact Greedy Algorithm 1st paragraph). Chen teaches the input data is a memory block that can be further sub-sampled into subsets of columns in a block, with each column representing a different piece of memory (where this input data represents an aspect of “learning data” being divided into subsets of columns representing “learning data divided …”). Chen further teaches the gradient statistics associated with decision tree nodes (another aspect of “learning data”) must be pre-fetched into an internal buffer structure associated with each thread (associated with each CPU processor core) that fits the cache size to prevent unnecessary CPU cache misses, where this internal buffer for each thread (associated with each CPU processor core) corresponds to “learning data divided to be stored in a plurality of … memories” (Chen p.8 col.1 6th paragraph (Section 6.2 Dataset and Setup): “…All the single machine experiments are conducted on a Dell PowerEdge R420 with two eight-core Intel Xeon (E5-2470) (2.3GHz) and 64 GB of memory. … all the experiments are run using all the available cores in the machine.”; p.5 col.2 Section 4.1 Column Block for Parallel Learning: “The most time consuming part of tree learning is to get the data into sorted order. In order to reduce the cost of sorting, we propose to store the data in in-memory units, which we called block. Data in each block is stored in the compressed column (CSC) format, with each column sorted by the corresponding feature value. … we store the entire dataset in a single block and run the split search algorithm by linearly scanning over the pre-sorted entries. …. Collecting statistics for each column can be parallelized, giving us a parallel algorithm for split finding. … the column block structure also support column subsampling, as it is easy to select a subset of columns in a block.”; and p.6 col.1 Section 4.2 Cache-aware Access 1st-2nd paragraphs: “… the new algorithm requires indirect fetches of gradient statistics by row index, since these values are accessed in order of feature. This is a non-continuous memory access. … This slows down split finding when the gradient statistics do not fit into a CPU cache and cache misses occur. … we can alleviate the problem by a cache-aware prefetching algorithm. Specifically, we allocate an internal buffer in each thread, fetch the gradient statistics into it, and then perform accumulation in a mini-batch manner.”).); and 
a plurality of … memories each configured to store data of the decision tree learned by a corresponding one of the plurality of learners (Examiner’s note: Under its broadest reasonable data of the decision tree learned” is interpreted as referring to the data being learned at each node, which includes the gradient values and the split point value for each node. Chen teaches computing scores for each leaf nodes using the greedy algorithm (Chen p.3 Algorithm 1), which involves computing intermediate gradient values                         
                            
                                
                                    g
                                
                                
                                    j
                                
                            
                        
                    ,                         
                            
                                
                                    h
                                
                                
                                    j
                                
                            
                        
                    , and accumulated gradients                         
                            
                                
                                    G
                                
                                
                                    L
                                
                            
                        
                    ,                         
                            
                                
                                    G
                                
                                
                                    R
                                
                            
                        
                    ,                          
                            
                                
                                    H
                                
                                
                                    L
                                
                            
                        
                    ,                          
                            
                                
                                    H
                                
                                
                                    R
                                
                            
                        
                     (representing gradient statistics) to determine the best split point values, with these gradient values used to compute scores for the leaf nodes (Chen p.3 Figure 2 and p.3 Eq. 6 and 7), where these gradient statistics and associated scores of the leaf nodes and split point values represent “data of the decision tree learned by corresponding one of the plurality of learning units”. As indicated earlier, Chen teaches storing these gradients and split point values in an internal buffer structure associated with each thread, with the storing of this gradient value and the split point value for each node in an internal buffer structure associated with each thread (CPU processor core) representing “a plurality of … memories each configured to store data of the decision tree learned by corresponding one of the plurality of learning units” (Chen p.6 col.1 Section 4.2 Cache-aware Access 1st-2nd paragraphs).) …  
a discriminator … wherein the discriminator is configured to update a sample weight by obtaining a sum total of leaf weights of leaves (Chen p.2 Figure 1: examiner’s note: Under its broadest reasonable interpretation, the term “a discriminator” broadly recites a component that performs a set of steps related to decision tree processing (classification). Chen teaches operations performed for each n example in a data set D that involves maintaining a continuous score on each leaf (i.e., its weight) for each tree in the tree ensemble, and performing a final prediction by summing up the scores in the corresponding leaves, where the continuous score is performed by using the decision rules in the trees to classify it into the leaves through the summation of the scores in the corresponding leaves given by the leaf weights                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                     (such that this summation of the scores of the corresponding leaves represents a process to “update a sample weight by obtaining a sum total of leaf weights of leaves …”). This summation represents a mathematical concept that can performed in software, through traversal of the nodes present in the tree (Chen p.2 col.1 last paragraph-col.2 1st paragraph (Section 2.1 Regularized Learning Objective): “ … a tree ensemble model (shown in Fig.1) uses K additive functions to predict the output. … Here q represents the structure of each tree that maps an example to the corresponding leaf index. T is the number of leaves in the tree. Each                         
                            
                                
                                    f
                                
                                
                                    k
                                
                            
                        
                     corresponds to an independent tree structure q and leaf weights w. Unlike decision trees, each regression tree contains a continuous score on each of the leaf, we use                         
                            
                                
                                    w
                                
                                
                                    i
                                
                            
                        
                     to represent score on i-th leaf. For a given example, we will use the decision rules in the trees (given by q) to classify it into the leaves and calculate the final prediction by summing up the score in the corresponding leaves (given by w).”).) …
… wherein the discriminator is configured to update gradient information of the learning data based on a sum total of leaf weight (Examiner’s note: Under its broadest reasonable interpretation, the term “gradient information of the learning data based on a sum total of leaf weight” is directed towards performing an XGBoost greedy algorithm determination. Chen teaches that the decision tree ensemble model defined in Chen p.2 Eq.(2) needs to be trained in an additive manner in order to maximize the objective function, which is done by greedily adding instances associated with tree nodes, resulting in the objective function of finding an optimal split point value calculated through the calculation of gradients and weights shown in Chen p.3 Eq.(7). These summed gradients shown in Chen Figure 2 represent parameters for each leaf node in the decision tree, and the relationship between gradients and weight values is shown in Chen p.3 Eq.(5), with the gradients                         
                            
                                
                                    g
                                
                                
                                    i
                                
                            
                        
                     and                         
                            
                                
                                    h
                                
                                
                                    i
                                
                            
                        
                     shown in Chen Figure 2 and Chen p.3 Eqs.(5) and (7) representing “gradient information of the learning data”. A person having ordinary skill in the art can take Chen p.3 Eq.(5) and re-write its equivalents such that the gradients are computed in terms of a weight at a j-th node, thereby corresponding to “update gradient information of the learning data … based on a sum total of leaf weight”, where the sum total of leaf weight represents the continuous score on an i-th leaf (Chen p.2 col.1 last paragraph-col.2 1st paragraph (Section 2.1 Regularized Learning Objective); p.2 col.2 Section 2.2 Gradient Tree Boosting: “The tree ensemble model in Eq.(2) includes functions as parameters and cannot be optimized using traditional optimization methods … Instead, the model is trained in an additive manner. … Formally, let                         
                            
                                
                                    
                                        
                                            y
                                        
                                        ^
                                    
                                
                                
                                    i
                                
                                
                                    (
                                    t
                                    )
                                
                            
                        
                     be the prediction of the i-th instance at the t-th iteration, we will need to add                         
                            
                                
                                    f
                                
                                
                                    t
                                
                            
                        
                     to minimize the following objective. … This means we greedily add the                         
                            
                                
                                    f
                                
                                
                                    t
                                
                            
                        
                     that most improves our model … Define                         
                            
                                
                                    I
                                
                                
                                    j
                                
                            
                        
                     = {i|q(                        
                            
                                
                                    x
                                
                                
                                    i
                                
                            
                        
                    ) = j} as the instance set of leaf j. … For a fixed structure q(x), we can compute the optimal weight                         
                            
                                
                                    w
                                
                                
                                    j
                                
                                
                                    *
                                
                            
                            =
                            -
                            (
                            
                                
                                    ∑
                                    
                                        i
                                        ∈
                                        
                                            
                                                I
                                            
                                            
                                                j
                                            
                                        
                                    
                                
                                
                                    
                                        
                                            g
                                        
                                        
                                            i
                                             
                                        
                                    
                                    )
                                    /
                                    (
                                    
                                        
                                            ∑
                                            
                                                i
                                                ∈
                                                
                                                    
                                                        I
                                                    
                                                    
                                                        j
                                                    
                                                
                                            
                                        
                                        
                                            
                                                
                                                    h
                                                
                                                
                                                    i
                                                     
                                                
                                            
                                            +
                                             
                                            λ
                                            )
                                             
                                        
                                    
                                     
                                
                            
                            
                                
                                    5
                                
                            
                        
                    , and calculate the corresponding optimal value by                         
                            
                                
                                    
                                        
                                            L
                                        
                                        ~
                                    
                                
                                
                                    
                                        
                                            t
                                        
                                    
                                
                            
                            
                                
                                    q
                                
                            
                            =
                            -
                            
                                
                                    1
                                
                                
                                    2
                                
                            
                            
                                
                                    ∑
                                    
                                        j
                                        =
                                        1
                                    
                                    
                                        T
                                    
                                
                                
                                    
                                        
                                            
                                                
                                                    
                                                        
                                                            (
                                                            
                                                                
                                                                    ∑
                                                                    
                                                                        i
                                                                        ∈
                                                                        
                                                                            
                                                                                I
                                                                            
                                                                            
                                                                                j
                                                                            
                                                                        
                                                                    
                                                                
                                                                
                                                                    
                                                                        
                                                                            g
                                                                        
                                                                        
                                                                            i
                                                                        
                                                                    
                                                                
                                                            
                                                        
                                                        
                                                             
                                                        
                                                    
                                                    )
                                                
                                                
                                                    2
                                                
                                            
                                        
                                        
                                            
                                                
                                                    ∑
                                                    
                                                        i
                                                        ∈
                                                        
                                                            
                                                                I
                                                            
                                                            
                                                                j
                                                            
                                                        
                                                    
                                                
                                                
                                                    
                                                        
                                                            h
                                                        
                                                        
                                                            i
                                                        
                                                    
                                                
                                            
                                            +
                                            λ
                                        
                                    
                                    +
                                    γ
                                    T
                                
                            
                             
                            
                                
                                    6
                                
                            
                            .
                             
                        
                     … A greedy algorithm that starts from a single leaf and iteratively adds branches to the tree is used instead. Assume that                         
                            
                                
                                    I
                                
                                
                                    L
                                
                            
                        
                     and                         
                            
                                
                                    I
                                
                                
                                    R
                                
                            
                        
                     are the instance sets of left and right nodes after the split. Letting I =                         
                            
                                
                                    I
                                
                                
                                    L
                                
                            
                        
                    ∪                         
                            
                                
                                    I
                                
                                
                                    R
                                
                            
                        
                    , then the loss reduction after the split is given by                          
                            
                                
                                    L
                                
                                
                                    s
                                    p
                                    l
                                    i
                                    t
                                
                            
                            =
                            
                                
                                    1
                                
                                
                                    2
                                
                            
                            
                                
                                    
                                        
                                            
                                                
                                                    
                                                        
                                                            (
                                                            
                                                                
                                                                    ∑
                                                                    
                                                                        i
                                                                        ∈
                                                                        
                                                                            
                                                                                I
                                                                            
                                                                            
                                                                                L
                                                                            
                                                                        
                                                                    
                                                                
                                                                
                                                                    
                                                                        
                                                                            g
                                                                        
                                                                        
                                                                            i
                                                                        
                                                                    
                                                                
                                                            
                                                        
                                                        
                                                             
                                                        
                                                    
                                                    )
                                                
                                                
                                                    2
                                                
                                            
                                        
                                        
                                            
                                                
                                                    ∑
                                                    
                                                        i
                                                        ∈
                                                        
                                                            
                                                                I
                                                            
                                                            
                                                                L
                                                            
                                                        
                                                    
                                                
                                                
                                                    
                                                        
                                                            h
                                                        
                                                        
                                                            i
                                                        
                                                    
                                                
                                            
                                            +
                                            λ
                                        
                                    
                                    +
                                    
                                        
                                            
                                                
                                                    
                                                        
                                                            (
                                                            
                                                                
                                                                    ∑
                                                                    
                                                                        i
                                                                        ∈
                                                                        
                                                                            
                                                                                I
                                                                            
                                                                            
                                                                                R
                                                                            
                                                                        
                                                                    
                                                                
                                                                
                                                                    
                                                                        
                                                                            g
                                                                        
                                                                        
                                                                            i
                                                                        
                                                                    
                                                                
                                                            
                                                        
                                                        
                                                             
                                                        
                                                    
                                                    )
                                                
                                                
                                                    2
                                                
                                            
                                        
                                        
                                            
                                                
                                                    ∑
                                                    
                                                        i
                                                        ∈
                                                        
                                                            
                                                                I
                                                            
                                                            
                                                                R
                                                            
                                                        
                                                    
                                                
                                                
                                                    
                                                        
                                                            h
                                                        
                                                        
                                                            i
                                                        
                                                    
                                                
                                            
                                            +
                                            λ
                                        
                                    
                                     
                                    -
                                    
                                        
                                            
                                                
                                                    
                                                        
                                                            (
                                                            
                                                                
                                                                    ∑
                                                                    
                                                                        i
                                                                        ∈
                                                                        
                                                                            
                                                                                I
                                                                            
                                                                            
                                                                        
                                                                    
                                                                
                                                                
                                                                    
                                                                        
                                                                            g
                                                                        
                                                                        
                                                                            i
                                                                        
                                                                    
                                                                
                                                            
                                                        
                                                        
                                                             
                                                        
                                                    
                                                    )
                                                
                                                
                                                    2
                                                
                                            
                                        
                                        
                                            
                                                
                                                    ∑
                                                    
                                                        i
                                                        ∈
                                                        
                                                            
                                                                I
                                                            
                                                            
                                                        
                                                    
                                                
                                                
                                                    
                                                        
                                                            h
                                                        
                                                        
                                                            i
                                                        
                                                    
                                                
                                            
                                            +
                                            λ
                                        
                                    
                                
                            
                            -
                            γ
                             
                             
                            
                                
                                    7
                                
                            
                            .
                        
                    ”).).
While Chen teaches a plurality of learners performing learning in a tree ensemble, where each node in the tree contains feature information used to determine split points and the associated gradients and continuous scores (weights) for each leaf, Chen does not explicitly teach
… learning data divided to be stored in a plurality of data memories; …
… a plurality of model memories … to store data of the decision tree learned …  
… a discriminator configured to read out each feature amount of the learning data from the data memory, and 
based on a branch condition for a node of the decision tree, the branch condition being derived based on the feature amount, discriminate a lower node to which the learning data read out from the data memory is to branch from the node …
Owaida teaches
… learning data divided to be stored in a plurality of data memories (Examiner’s note: Owaida teaches a hybrid CPU-FPGA architecture for classifying gradient boosted decision trees, where the FPGA architecture contains a plurality of compute-units (each compute-unit represented by a processor core, Owaida p.1 col.2 3rd paragraph: “… we introduce a hybrid classification engine for XGBoost on a CPU+FPGA shared memory platform (Intel’s Xeon+FPGA platform [14]).”), with the compute-units associated with a plurality of decision-tree processing elements (DT-PEs). The classification processing is managed by a software driver running on the compute-units (Owaida p.2 col.2 2nd paragraph: “The classifier’s FPGA architecture is accompanied with a software driver on the CPU side.”). Owaida teaches each DT-PE contains a data memory which stores incoming input examples (corresponding to one aspect of “learning data”) with a fixed capacity (as each DT-PE contains a read and write port to allow pre-fetching of next data examples while the current data examples are processed). Owaida further teaches the fixed capacity of the data memory indicates that a large number of data examples cannot be … learning data divided to be stored in a plurality of data memories” (Owaida p.3 Figure 2 (Left and Middle); p.2 col.2 Section III. Classifier Engine Overview 5th paragraph: “The classifier FPGA architecture consists of 8 Compute Units, each includes 8 decision tree processing elements (DT-PE) as Figure 2 shows. A DT-PE unit is programmed dynamically to evaluate one or more decision trees per data example. A DT-PE unit outputs a leaf value for each tree evaluated. … Multiple Compute Units are combined to process larger tree ensembles. The actual number is determined by the software driver using the tree ensemble’s size. … The Scheduler executes the software driver’s decision on how to distribute the ensemble trees to the Compute Units and how to parallelize the processing of different data examples.”; and p.3 col.2 2nd paragraph (Section IV.A. DT-PE Memory Layout): “The data memory stores incoming data examples and has a capacity of 4096 features (floating point). The data memory has one write and one read port, allowing prefetching the next data examples while available data examples are being processed.”).); …
… a plurality of model memories … to store data of the decision tree learned (Examiner’s note: Owaida teaches each DT-PE contains data memory and tree memory, where the tree memory stores the node information in the decision and leaf nodes (representing “data of the decision tree learned…”) for a single tree or multiple trees. Owaida further teaches associated shared memory containing request and the response result for the decision tree, where this shared memory is controlled and monitored by the I/O unit that manages the data transfers between compute-units and DT-PE units. The tree memory, data memory, and the shared memory correspond to “model memory” for the decision tree, and hence the plurality of DT-PEs containing a plurality of tree and data memories (along with this shared memory) represent “a plurality of model memories … to store data of the decision tree learned …” (Owaida p.2 col.2 Section III. Classifier Engine Overview 4th-6th paragraphs: “… the driver composes a classification request and writes it to a designated shared memory location monitored by the classifier I/O Unit on the FPGA. … The classifier FPGA architecture consists of 8 Compute Units, each includes 8 decision tree processing elements (DT-PE) … A DT-PE unit outputs a leaf value for each tree evaluated. The Reducer unit is a tree of floating point adders. It sums up the leaf values from all 8 DT-PE units … The Combiner consists of a single floating point adder and iterative accumulates partial results generated by the Compute Units to produce the final result per data example. … the I/O unit loads the tree ensemble and stores it in the Compute Units’ local memory. Then, it reads the data examples and writes back the classification results.”; p.3 Figure 2, and p.3 col.1 3rd paragraph – col.2 2nd paragraph (Section IV.A. DT-PE Memory Layout): “The architecture of the DT-PE unit is depicted in Figure 2. The DT-PE unit consists of two types of components: local memories store the tree ensemble and the data example’s features, and a datapath evaluates a tree node for input data examples. … There are two types of local memories in the DT-PE unit: tree memory, and data memory. The tree memory either stores one big tree up to 8192 nodes (decision and leaf nodes), or multiple trees sharing equally overall memory capacity. The tree nodes are stored as a one dimensional array (Figure 2). The storage scheme assumes a full binary tree with no missing nodes and every node stored at a dedicated location. … The data memory stores the incoming data examples and has a capacity of 4096 features (floating point).”).) …
… a discriminator configured to read out each feature amount of the learning data from the data memory (Examiner’s note: Under its broadest reasonable interpretation, the term “a discriminator” broadly recites a component that performs a set of steps related to decision tree processing (classification). Owaida teaches a DT-PE that reads a feature from an example stored in the data memory (as shown in the block labeled ‘Read Feature’ in Owaida p.3 Figure 2 (Middle), thus representing a component to “read out each feature amount of the learning data from the data memory”). The scheduler for the software driver executing on the compute-units coordinates the task of parallelizing the processing of different examples (Owaida p.2 col.2 Section III. Classifier Engine Overview 5th paragraph), such that each compute-unit/DT-PE pair performs the functionality of a discriminator (Owaida p.3 Figure 2 (Middle); and p.3 col.2 Section IV.B. DT-PE Datapath: “The DT-PE’s datapath pipeline consists of four operations: … reading the corresponding data example feature from the data memory …”).), and 
based on a branch condition for a node of the decision tree, the branch condition being derived based on the feature amount, discriminate a lower node to which the learning data read out from the data memory is to branch from the node (Examiner’s note: As indicated earlier, Owaida teaches each compute-unit/DT-PE pair performs the functionality of a discriminator through the reading out of feature information from a data memory (Owaida p.3 Figure 2 (Middle); p.2 col.2 Section III. Classifier Engine Overview 5th paragraph; and p.2 col.2 Section III. Classifier Engine Overview 5th paragraph). Owaida teaches the tree node representing a decision node in a decision tree and containing criteria for choosing a left or right child node in the next level (as shown in the block labeled ‘Read Tree Node’ in Owaida p.3 Figure 2 (Middle)), where the criteria stored in the tree node represents thresholds (Owaida p.3 col.1 Section IV.A. DT-PE Memory Layout, 1st paragraph). Owaida teaches a comparison is done with the feature read from the data memory, in order to choose the proper child node address in the next level or to read the value of the node if it is a leaf node (as shown in the block labeled ‘Evaluate’ and ‘Read Leaf/Compute next node pointer’ in Owaida p.2 Figure 2 (Middle)). This comparison done with the threshold and the feature to select the child node represents a branch condition, and the result of the comparison (to choose either a left or right child node for a non-leaf node) represents the act of performing a classification to choose the proper child node based on the comparison result (Owaida p.3 Figure 2 (Middle); p.3 col.2 Section IV.B. DT-PE Datapath: “The DT-PE’s datapath pipeline consists of four operations: reading a tree node from the tree memory; … comparing the tree node threshold to the feature values; and either computing the next decision node pointer or reading the leaf node.”; p.2 col.1 Section II.B. Decision Tree 1st paragraph: “Each non-leaf node is called a decision node and each leaf node is called an end node. Each decision node contains criteria for choosing either the left or right node in the next level, and each end node contains the classification or regression result (i.e., label). During inference, an example traverses from the root to an end node according to the criteria of decision nodes.”; and p.3 col.1 Section IV.A. DT-PE Memory Layout, 1st paragraph: “… Such a storage layout allows the calculation of the of the child node pointer using the parent pointer as follows: child_pointer = (parent_pointer <<1)+1+GO_RIGHT where GO_RIGHT either equals 1 or 0, based on the comparison result of the parent node threshold and the corresponding feature values.”).) …  
Both Chen and Owaida are analogous art since they teach hardware using gradient boosting decision trees.
(Owaida p.1 col.2 2nd-6th paragraphs (Section I. Introduction): “… Hybrid processing of tree ensembles using both the CPU and FPGA is a potential approach to scale to very large ensembles and deep trees. It also provides flexibility to adapt to changing structure of the ensembles. … Based on this, in this paper we introduce a hybrid classification engine for XGBoost on a CPU+FPGA shared memory platform (Intel’s Xeon+FPGA platform [14]). … In developing the hybrid classification engine we have three objectives: 1) scalability to large tree ensembles with deep trees through hybrid execution on CPU and FPGA; 2) Programmability at run time with tree ensembles of different sizes; 3) Efficient FPGA resources management to achieve maximum performance. … Experimental evaluation demonstrates that the designed classifier delivers up to 20x speedup over 10-threaded CPU only implementation when processing the complete tree ensemble on the FPGA. For tree ensembles that do not fit in the FPGA’s on-chip memory, the hybrid CPU+FPGA processing delivers an order of magnitude speedup over 10-threaded CPU only implementation.”).
While Chen in view of Owaida teaches storage of decision and leaf nodes containing information such as gradients and split point values, Chen in view of Owaida does not explicitly teach
… to which the learning data stored 
Nishiyama teaches
… to which the learning data stored (Examiner’s note: Nishiyama teaches an example of a leaf node structure Nishiyama Figure 7, where a flag byte indicates that this is a leaf node, and storage locations for storing leaf values, with one 4 byte area containing a score (corresponding to a “leaf weight”) for a positive class indication, and the other 4 byte area containing a score for a negative class indication. Nishiyama further teaches that the stored score for a leaf is based on a probability of movement to its child nodes, which is interpreted as determining splits for the decision tree (Nishiyama [0054]-[0056]). Hence, this data structure for the leaf node for storing the score and leaf/not leaf indication represents a relationship “… to which the learning data stored and branched in the decision tree” (Nishiyama Figure 7 and [0031]: “FIG. 7 shows a view illustrating a data structure of the end node (the leaf node). In FIG.7, the end node includes a flag f which represents the node is the end node, S1 indicating a probability score that the inputted data that has reached the end node includes a vehicle (for this example), and s2 indicating a probability score that the inputted data that has reached the end node does not include a vehicle. The data of the end node needs 9 bytes, of the total 1 byte is the flag f, 4 bytes is s1, and 4 bytes is s2. The probability s1 and s2 each needs 4 bytes so as to represent the number of decimal places.”).).  
Both Chen in view of Owaida and Nishiyama are analogous art since they teach performing decision tree classification using data stored in the decision tree nodes.
It would have been obvious to a person having ordinary skill in the art before the effective filing date to take the leaf scores (generated by the gradients) taught in Chen in view of Owaida and store them in the leaf node data structure taught in Nishiyama as a way provide localized access to the score information for the leaf nodes. The motivation to combine is taught in Nishiyama, as a way to store values in local memory associated with the discriminating unit (implemented by a processing element) without constantly accessing global memory, thereby reducing the number of memory accesses and improving the overall run-time of the system (Nishiyama [0005]: “In a discriminating process using the decision tree, transitions between nodes occur frequently. Thereby it becomes necessary to access an address where each node is held in the memory, which tends to cause accessing addresses being away from each other. Such memory accesses lead to a decrease in a cache hit ratio and cause slowdowns of processes.”).
Regarding amended Claim 2, 
Chen in view of Owaida, in further view of Nishiyama teaches
The learning device according to claim 1, wherein the plurality of learners are configured to 
cause the plurality of model memories to store the same data of the learned decision tree (Examiner’s note: Under its broadest reasonable interpretation, this claim limitation exhibits a 112(b) indefiniteness issue, and hence for purposes of examination, this limitation is interpreted as broadly reciting that each decision tree processing is duplicated and processed in a pipeline parallelism fashion (i.e., each model memory stores the learned data that is from the processing of a particular (same) tree vs. processing different trees). Owaida teaches each DT-PE stores the node information for the decision and leaf nodes. The decision and leaf nodes contain information such as the gradients and split point values taught in Chen p.3 Algorithm 1 and p.6 col.2 2nd paragraph (Section 4.2 Cache-aware Access), and the split point values represent the branch condition criteria for performing classification (Owaida p.2 Figure 1 and p.2 col.1 Section II.B. Decision Tree 1st paragraph: “Each non-leaf node is called a decision node and each leaf node is called an end node. Each decision node contains criteria for choosing either the left or right node in the next level, and each end node contains the classification or regression result (i.e., label).”). Collectively, the information stored in the nodes correspond to “data of the decision tree learned…”) in tree memory for a single tree or multiple trees, where the tree memory is identified earlier as being one part of “a model memory”, and a plurality of DT-PEs will correspond to “a plurality of model memories”. Owaida teaches mapping of the decision trees such that different trees are not spread across all compute-units/DT-PEs, and instead are cloned across the number of compute-units, resulting in the plurality of tree memories containing the same decision tree information during classification processing, thus representing “wherein the plurality of learning units are configured to cause the plurality of model memories to store the same data of the learned decision tree” (Owaida p.3 col.1 Section IV. Decision Tree Processing Element (DT-PE), 1st-2nd paragraphs: “The architecture of the DT-PE unit is depicted in Figure 2. The DT-PE unit consists of two types of components: local memories store the tree ensemble and the data example’s features, and a datapath evaluates a tree node for input data examples. … There are two types of local memories in the DT-PE unit: tree memory, and data memory. The tree memory either stores one big tree up to 8192 nodes (decision and leaf nodes), or multiple trees sharing equally overall memory capacity. The tree nodes are stored as a one dimensional array (Figure 2). The storage scheme assumes a full binary tree with no missing nodes and every node stored at a dedicated location.”; and p.4 col.1 Section V.A. Mapping Trees on FPGA Memory: “The classifier software driver is designed to maximize the processing throughput in data examples per second. … Since the Combiner does not parallelize the aggregation of partial results, the driver avoids spreading the tree ensemble across all Compute Units and exploits pipeline parallelism in the datapath. We try to pack and fit the whole tree ensemble in 1, 2, 4, or 8 Compute Units. We select these numbers so we can have multiple clones of the tree ensemble, each occupying the same number of Compute Units. Multiple clones of the tree ensemble can then be used to parallelize the processing of different data examples.”).).  
Regarding amended Claim 3, 
Chen in view of Owaida, in further view of Nishiyama teaches
The learning device according to claim 1, wherein each of the plurality of learners comprises: 
a data memory of the plurality of model memories configured to store the learning data (Examiner’s note: Owaida teaches each DT-PE contains data memory and tree memory, where the tree memory stores the node information (decision and leaf nodes, thus corresponding to “data of the decision tree learned…”) for a single tree or multiple trees, and the data memory stores the incoming input examples. Both the tree memory and data memory correspond to “model memory” for the decision tree, and both memories store different aspects of “learning data”. Hence, the arrangement of the plurality of tree and data memories (corresponding to “a plurality of model memories”) within the corresponding DT-PEs is such that the plurality of data memories are a subset of the plurality of model memories, thus corresponding to “a data memory of the plurality of model memories configured to store the learning data”, which is a re-phrasing of a similar claim limitation found in Claim 1 (Owaida p.3 col.1 3rd paragraph – col.2 2nd paragraph (Section IV.A. DT-PE Memory Layout): “The architecture of the DT-PE unit is depicted in Figure 2. The DT-PE unit consists of two types of components: local memories store the tree ensemble and the data example’s features, and a datapath evaluates a tree node for input data examples. … There are two types of local memories in the DT-PE unit: tree memory, and data memory. The tree memory either stores one big tree up to 8192 nodes (decision and leaf nodes), or multiple trees sharing equally overall memory capacity. The tree nodes are stored as a one dimensional array (Figure 2). The storage scheme assumes a full binary tree with no missing nodes and every node stored at a dedicated location. … The data memory stores the incoming data examples and has a capacity of 4096 features (floating point).”).).
Regarding amended Claim 8, 
Chen in view of Owaida, in further view of Nishiyama teaches
(Currently Amended) The learning device according to claim 1, wherein each of the plurality of learners is configured to 
perform learning of a first node using the learning data acquired using a first address related to a storage destination of learning data corresponding to the first node of the decision tree in corresponding one of the plurality of data memories (Examiner’s note: Under its broadest interpretation, this claim limitation encompasses two aspects: “... using learning data acquired using a first address related to a storage destination of learning data corresponding to the first node of the decision tree” is directed towards accessing the tree memory information, while the term “… in corresponding one of the plurality of data memories” is directed towards accessing the corresponding data example for the decision tree node. Owaida teaches each DT-PE contains tree memory containing decision and leaf nodes (where the decision and leaf nodes in the tree memory correspond to “a first node”, “a second node”, “a next node”). The datapath pipeline for each DT-PE involves reading a tree node and reading a data example feature. Reading/writing tree nodes and data examples from/to memory involves accessing a pointer to the memory location (with a pointer corresponding to a memory address). Hence, the software driver executing on a compute-unit will perform two operations: read a parent node from the stored decision tree (corresponding to “… a first node using learning data acquired using a first address related to a storage destination of learning data corresponding to the first node of the decision tree”), and read a corresponding data example feature from the data memory (with the reading of the data example feature corresponding to “… in corresponding one of the plurality of data memories”). Collectively, both operations correspond to “perform learning of a first node using the learning data acquired using a first address related to a storage destination of learning data corresponding to the first node of the decision tree in corresponding one of the plurality of data memories” (Owaida p.3 col.2 Section IV.B. DT-PE Datapath 1st paragraph: “The DT-PE’s datapath pipeline consists of four operations: reading a tree node from the tree memory; reading the corresponding data example feature from the data memory …”).), and 
output a second address related to a storage destination of the learning data that branches from the first node (Examiner’s note: Owaida teaches each DT-PE contains tree memory containing decision and leaf nodes (where the decision and leaf nodes in the tree memory correspond to “a first node”, “a second node”, “a next node”). The datapath pipeline for each DT-PE involves reading a tree node and reading a data example feature (Owaida p.3 col.2 Section IV.B. DT-PE Datapath 1st paragraph). Owaida uses a formula for computing the pointer (corresponding to a memory address, “a second address”) for a child node relative to the parent node pointer (also corresponding to a memory address, “a first address”, with the relationship between the child node and a parent node being one in which the child node branches from the parent node). Hence this child node pointer corresponds to “output a second address related to a storage destination of the learning data that branches from the first node” (Owaida p.3 col.1 Section IV.A. DT-PE Memory Layout, 1st paragraph: “…The tree memory either stores one big tree up to 8192 nodes (decision and leaf nodes), or multiple trees sharing equally overall memory capacity. The tree nodes are stored as a one-dimensional array (Figure 2). The storage scheme assumes a full binary tree with no missing nodes and every node stored at a dedicated location. Each tree consumes a memory footprint equaling                         
                            
                                
                                    2
                                
                                
                                    M
                                    A
                                    X
                                    _
                                    T
                                    R
                                    E
                                    E
                                    _
                                    D
                                    E
                                    P
                                    T
                                    H
                                
                            
                        
                    . … If a tree node is pruned, its dedicated memory location stays empty and is not used by another tree node. Such a storage layout allows the calculation of the child node pointer using the parent pointer as follows: child_pointer = (parent_pointer << 1)+1+GO RIGHT where GO_RIGHT either equals 1 or 0, based on the comparison result of the parent node threshold and the corresponding feature values.”).), and 
the learning device further comprises a plurality of managers each corresponding to one of the plurality of learners, each of the managers being configured to calculate a third address related to a storage destination of the learning data corresponding to a second node as a next node of the first node using the first address and the second address output from the learner (Examiner’s note: Under its broadest interpretation, “... a third address related to a storage destination of the learning data corresponding to the second node as a next node of the first node using the first address and the second address output from the learning unit” is directed towards accessing the tree memory information. Furthermore, “a plurality of managers” is being interpreted under 35 U.S.C. 112(f) as a plurality of address managers implemented in a control unit of a CPU (specified in applicant’s specification p.104: “The control module 15 is an arithmetic module that controls learning by the GBDT … includes the CPU 10 and the address manager 12 (manager). The CPU 10 includes the control unit 11.”), or its equivalent (such as executable instructions running on a processor core or processing element). Owaida teaches that each DT-PE contains tree memory containing decision and leaf nodes (where the decision and leaf nodes in the tree memory correspond to “a first node”, “a second node”, “a next node”). The datapath pipeline for each DT-PE involves reading a tree node and reading a data example feature (Owaida p.3 col.2 Section IV.B. DT-PE Datapath 1st paragraph). Owaida uses a formula for computing the pointer (corresponding to a memory address, “a second address”) for a child node relative to the parent node pointer (also corresponding to a memory address, “a first address”, with the relationship between the child node and a parent node being one in which the child node branches from the parent node). After computing the child node pointer (corresponding to a “second address”) based on the formula taught in Owaida p.3 col.1 Section IV.A. DT-PE Memory Layout, 1st paragraph, this child node pointer is fed back into the first operation (i.e., reading a tree node from the tree memory) as the next decision node pointer for the next tree level. Hence, this next decision node pointer represents “a third address related to a storage destination of the learning data corresponding to a second node as a next node of the first node …”. The determination of this next pointer requires using the formula taught in Owaida p.3 col.1 Section IV.A. DT-PE Memory Layout, 1st paragraph, which is based on using a parent node pointer (“first address”) to produce the child node pointer (“second address output from learning unit”). Hence collectively, this iterative operation of producing next pointers for the next tree level based on the parent and child pointers from the previous tree level represents a process to “… calculate a third address related to a storage destination of the learning data corresponding to a second node as a next node of the first node using the first address and the second address output from the learning unit”. The scheduler for the software driver executing on the compute-units coordinates the task of parallelizing the processing of different examples (Owaida p.2 col.2 Section III. Classifier Engine Overview 5th paragraph), with the software driver calculating node pointers based on the tree data structure (Owaida p.2 Section III. Classifier Engine Overview 3rd paragraph: “The user passes a pointer to the test data, and a data structure describing the tree ensemble model …”), hence each of the compute-units/DT-PEs also functions as an address manager, thus corresponding to “a plurality of managers each corresponding to one of the plurality of learning units, each of the managers configured to calculate a third address …” (Owaida p.3 col.2 Section IV.B. DT-PE Datapath: “The DT-PE’s datapath pipeline consists of four operations: reading a tree node from the tree memory; reading the corresponding data example feature from the data memory; comparing the tree node threshold to the feature value; and either computing the next decision node pointer or reading the leaf node. These are the operations required to evaluate one tree level for the input data example. To evaluate all the tree levels, the next decision node pointer is fed back to the first operation to continue processing subsequent levels. These four operations are iterated until a leaf node is reached …”).).  
Claims 6-7 are rejected under 35 U.S.C. 103 as being unpatentable over 
Chen et al., XGBoost: A Scalable Tree Boosting System, June 10 2016 [hereafter referred as Chen] in view of Owaida et al., Scalable Inference of Decision Tree Ensembles: Flexible Design for CPU-FPGA Platforms, 2017 [hereafter referred as Owaida], in further view of Nishiyama et al., U.S. PGPUB 2011/0178976, published 7/21/2011 [hereafter referred as Nishiyama] as applied to Claim 1; in even further view of Kamiya et al., WO2020090413, priority to JP2018-025795 filed 10/31/2018 [hereafter referred as Kamiya].
Regarding amended Claim 6, 
Chen in view of Owaida, in further view of Nishiyama as applied to Claim 1 teaches
(Currently Amended) The learning device according to claim 1.
However, Chen in view of Owaida, in further view of Nishiyama does not teach
… wherein the discriminator comprises performance calculating circuitry configured to calculate an index value of recognition performance of the learned decision tree based on the sample weight corresponding to the learning data stored in the corresponding data memory.  
Kamiya teaches
… wherein the discriminator comprises performance calculating circuitry configured to calculate an index value of recognition performance of the learned decision tree based on the sample weight corresponding to the learning data stored in the corresponding data memory (Examiner’s note: Under its broadest reasonable interpretation, the term “a performance calculating circuitry” broadly recites any circuitry (with associated set of steps or instructions) that performs performance calculations. Kamiya teaches a classification device containing a learning device that retrieves feature data to perform a binary classification, and uses an optimization unit that receives this feature data (“learning data stored in the corresponding data memory”) and a corresponding score from a scoring calculation unit (“the sample weight”) to maximize an objective function by performing a calculation that approximates a portion of the AUC used for classifying the data into either a positive example or a negative example. Hence, the CPU implementing an optimization unit (using the scores from the scoring calculation unit) functions as performance calculating circuitry, with the AUC value being defined as a value that measures the correctness (or accuracy) of the positive example or negative example classification that is used as an index of classification performance. Kamiya further teaches the optimization unit is designed such different types of objective functions can be applied include probability gradient descent, or can be rewritten to fit the needs of the problem to be solved (Kamiya [0050]), and hence can be adapted to other binary classification problems. Hence, using this optimization unit in the context of a decision tree classification (where decision tree classification is a form of a binary classification performed between a parent node and its two child nodes) allows this optimization unit to represent “… a performance calculating circuitry configured to calculate an index value of recognition performance of the learned decision tree based on the sample weight corresponding to the learning data stored in the corresponding data memory” (Kamiya Figure 4 and [0030]-[0039]: “… The control unit 15 is implemented using a central processing unit (CPU) or the like, and executes a processing program stored in the memory. … the control unit 15 functions as … an optimization unit 15d … the classification device 10 may be separated into a learning device having a learning data acquisition unit 15a, a feature extraction unit 15b, a score calculation unit 15c, and an optimization unit 15d … The learning data acquisition unit 15 an acquires learning data used for classification processing …  The feature extraction unit 15b extracts a feature amount of the acquired learning data as a preparation for use in processing of the optimization unit 15d … The weight W is output as a result of a classification process to be described later.. An arbitrary value may be set to the initial value of the weight. … The optimization unit 15d is a learning unit. That is, the optimization unit 15D learns the weight W so as to maximize the approximated objective function by approximating the portion of the nonlinear function of the objective function representing the PAUC to the AUC of a partial section of the ROC curve for the classifier for classifying the data into either a positive example or a negative example by the calculated score. … Specifically, the optimization unit 15d determines a weight W for maximizing the PAUC at an arbitrary partial section [alpha, beta] of the ROC curve represented by formula (2)”; and [0006]: “… the AUC is a value in which the correctness of both the positive example and the negative example is taken into consideration.. For this reason, AUC is effective as an index of classification performance as compared with a correct answer rate or the like that is calculated as 99% … in a binary classification problem …”).).
Both Chen in view of Owaida, in further view of Nishiyama and Kamiya are analogous art since they teach using learning devices to perform binary classification.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take decision tree learning and classification algorithm performed in the learning units taught in Chen in view of Owaida, in further view of Nishiyama and enhance it to include an AUC calculator taught in Kamiya as a way to determine a predictive accuracy of a binary classification task. The motivation to combine is taught in Kamiya, as in general, calculating the area under the curve is useful to determine the predictive accuracy of a system running a binary classification task. Performing these calculations within the system allows the model to determine an error detection rate, which is a useful metric used in conjunction with the prediction to determine whether the model is correctly predicting an expected result (true positive) or not as a result of the binary classification, thereby providing an indication of the reliability of the classification to a user and possible insights for reducing the classification erroneous detection rate (Kamiya paragraphs [0003]-[0007]: “In a binary classification problem in which the feature amount of data to be classified is classified into a positive example and a negative example, the performance of the classifier is defined using true positive (Tp, True positive),false positive (fp, false positive), false negative (fn, false negative), and true negative (tn, true negative). The true positive means that the positive example is correctly classified as a positive example, and the false positive means that the negative example is erroneously classified as a positive example. Further, the false negative means that the positive example is erroneously classified as a negative example, and the true negative means that the negative example is correctly classified as a negative example. …. in a task in practical use, the detection rate (Tpr) in the region where the erroneous detection rate (Fpr) is low is important. For example, when it is determined whether or not cancer is cancer, if the erroneous detection rate is high, it is determined that cancer is erroneously determined to a large number of normal persons. Therefore, it is desirable to optimize the detection rate while suppressing the erroneous detection rate.”).
Regarding amended Claim 7, 
Chen in view of Owaida, in further view of Nishiyama, in even further view of Kamiya teaches
(Currently Amended) The learning device according to claim 6, wherein the performance calculating circuitry is configured to calculate an Area Under the Curve (AUC) as the index value (Examiner’s note: As indicated earlier, Kamiya teaches a classification device containing a learning device which retrieves feature data to perform a binary classification, and uses an optimization unit to learn the weight by performing a calculation that approximates a portion of the AUC curve used for classifying the data into either a positive example or a negative example based on scores determined by a scoring calculation unit. Hence, the optimization unit (using the scores from the scoring calculation unit) functions as a performance calculator, with the AUC value being defined as a value that measures the correctness (or accuracy) of the positive example or negative example classification that is used as an index of classification performance (Kamiya paragraph [0006]: “… the AUC is a value in which the correctness of both the positive example and the negative example is taken into consideration.. For this reason, AUC is effective as an index of classification performance…”).).  
Claims 9-11 and 14-15 are rejected under 35 U.S.C. 103 as being unpatentable over 
Chen et al., XGBoost: A Scalable Tree Boosting System, June 10 2016 [hereafter referred as Chen] in view of Owaida et al., Scalable Inference of Decision Tree Ensembles: Flexible Design for CPU-FPGA Platforms, 2017 [hereafter referred as Owaida].
Regarding amended Claim 9, 
Chen teaches
A learning method for a learning device configured to perform learning of a decision tree by gradient boosting, the learning method comprising: 
learning the decision tree using learning data divided to be stored in a plurality of … memories by a plurality of learners (Examiner’s note: Under its broadest reasonable interpretation, the term “a plurality of learners” broadly recites a plurality of components performing steps related to decision tree learning, and the term “learning data” is interpreted as encompassing aspects of the input data instances (containing feature information) as well as data stored in the decision tree nodes. Chen teaches performing learning of a gradient boosting decision tree using a plurality of processor cores, with the input data stored in memory blocks for performing a greedy algorithm for finding split point values (Chen p.3 Algorithm 1), where this algorithm corresponds to decision tree learning as it finds the best split points for the decision tree (Chen p.3 col.1 4th paragraph (Section 2.2 Gradient Tree Boosting) and p.3 col.2 Section 3.1 Basic Exact Greedy Algorithm 1st paragraph). Chen teaches the input data is a memory block that can be further sub-sampled into subsets of columns in a block, with each column representing a different piece of memory (where this input data represents an aspect of “learning data” being divided into subsets of columns representing “learning data divided …”). Chen further teaches the gradient statistics associated with decision tree nodes (another aspect of “learning data”) must be pre-fetched into an internal buffer structure associated with each thread (associated with each CPU processor core) that fits the cache size to prevent unnecessary CPU cache misses, where this internal buffer for each thread (associated with each CPU processor core) corresponds to “learning data divided to be stored in a plurality of … memories” (Chen p.8 col.1 6th paragraph (Section 6.2 Dataset and Setup); p.5 col.2 Section 4.1 Column Block for Parallel Learning; and p.6 col.1 Section 4.2 Cache-aware Access 1st-2nd paragraphs).); and 
storing each piece of data of the learned decision tree learned by a corresponding one of the plurality of … memories (Examiner’s note: Under its broadest reasonable interpretation, the term “data of the decision tree learned” is interpreted as referring to the data being learned at each node, which includes the gradient values and the split point value for each node. As indicated earlier, Chen teaches computing scores for each leaf nodes using the greedy algorithm (Chen p.3 Algorithm 1), which involves computing intermediate gradient values                         
                            
                                
                                    g
                                
                                
                                    j
                                
                            
                        
                    ,                         
                            
                                
                                    h
                                
                                
                                    j
                                
                            
                        
                    , and accumulated gradients                         
                            
                                
                                    G
                                
                                
                                    L
                                
                            
                        
                    ,                         
                            
                                
                                    G
                                
                                
                                    R
                                
                            
                        
                    ,                          
                            
                                
                                    H
                                
                                
                                    L
                                
                            
                        
                    ,                          
                            
                                
                                    H
                                
                                
                                    R
                                
                            
                        
                     (representing Chen p.3 Figure 2 and p.3 Eq. 6 and 7), where these gradient statistics and associated scores of the leaf nodes and split point values represent “data of the decision tree learned by corresponding one of the plurality of learning units”. As indicated earlier, Chen teaches storing these gradients and split point values in an internal buffer structure associated with each thread, with the storing of this gradient value and the split point value for each node in an internal buffer structure associated with each thread (CPU processor core) representing “a plurality of … memories each configured to store data of the decision tree learned by corresponding one of the plurality of learning units” (Chen p.6 col.1 Section 4.2 Cache-aware Access 1st-2nd paragraphs).) …  
… updating gradient information by obtaining a sum total of leaf weights of leaves in the decision tree (Examiner’s note: Under its broadest reasonable interpretation, the term “gradient information of the learning data based on a sum total of leaf weight” is directed towards performing an XGBoost greedy algorithm determination. Chen teaches that the decision tree ensemble model defined in Chen p.2 Eq.(2) needs to be trained in an additive manner in order to maximize the objective function, which is done by greedily adding instances associated with tree nodes, resulting in the objective function of finding an optimal split point value calculated through the calculation of gradients and weights shown in Chen p.3 Eq.(7). These summed gradients shown in Chen Figure 2 represent parameters for each leaf node in the decision tree, and the relationship between gradients and weight values is shown in Chen p.3 Eq.(5), with the gradients                         
                            
                                
                                    g
                                
                                
                                    i
                                
                            
                        
                     and                         
                            
                                
                                    h
                                
                                
                                    i
                                
                            
                        
                     shown in Chen Figure 2 and Chen p.3 Eqs.(5) and (7) representing “gradient information of the learning data”. A person having ordinary skill in the art can take Chen p.3 Eq.(5) and re-write its equivalents such that the gradients are computed in terms of a weight at a j-th node, thereby corresponding to “update gradient information of the learning data … based on a sum total of leaf weight”, where the sum total of leaf weight represents the continuous score on an i-th leaf (Chen p.2 col.1 last paragraph-col.2 1st paragraph (Section 2.1 Regularized Learning Objective); p.2 col.2 Section 2.2 Gradient Tree Boosting).) …
While Chen teaches a plurality of learners performing learning in a tree ensemble, where each node in the tree contains feature information used to determine split points and the associated gradients and continuous scores (weights) for each leaf, Chen does not explicitly teach
… learning data divided to be stored in a plurality of data memories …
… storing each piece of data … one of a plurality of model memories …  
… reading out each feature amount of the learning data from the data memory, and 
based on a branch condition for a node of the decision tree, the branch condition being derived based on the feature amount, discriminating a lower node to which the learning data read out from the data memory is to branch from the node …
… a sum total of leaf weights of leaves … stored in the model memory.
Owaida teaches
… learning data divided to be stored in a plurality of data memories (Examiner’s note: As indicated earlier, Owaida teaches a hybrid CPU-FPGA architecture for classifying gradient boosted decision trees, where the FPGA architecture contains a plurality of compute-units (each compute-unit represented by a processor core, Owaida p.1 col.2 3rd paragraph), with the compute-units associated with a plurality of decision-tree processing elements (DT-PEs). The classification processing is managed by a software driver running on the compute-units (Owaida p.2 col.2 2nd paragraph). Owaida teaches each DT-PE contains a data memory which stores incoming input examples (corresponding to one aspect of “learning data”) with a fixed capacity (as each DT-PE contains a read and write port to allow pre-fetching of next data examples while the current data examples are processed). Owaida further teaches the fixed capacity of the data memory indicates that a large number of data examples cannot be stored and read all at once, and hence must be split up to read in a number of data examples at a time, resulting in this data memory within each compute-unit/DT-PE pair corresponding to “… learning data divided to be stored in a plurality of data memories” (Owaida p.3 Figure 2 (Left and Middle); p.2 col.2 Section III. Classifier Engine Overview 5th paragraph; and p.3 col.2 2nd paragraph (Section IV.A. DT-PE Memory Layout)).); …
… storing each piece of data … one of a plurality of model memories (Examiner’s note: Owaida teaches each DT-PE in a compute-unit contains data memory and tree memory, where the tree memory stores the node information in the decision and leaf nodes (representing “data of the decision tree learned…”) for a single tree or multiple trees. Owaida further teaches associated shared memory containing request and the response result for the decision tree, where this shared memory is controlled model memory” for the decision tree, and hence the plurality of DT-PEs containing a plurality of tree and data memories (along with this shared memory) represent “a plurality of model memories … to store data of the decision tree learned …” (Owaida p.2 col.2 4th-6th paragraphs; p.3 Figure 2, and p.3 col.1 3rd paragraph – col.2 2nd paragraph (Section IV.A. DT-PE Memory Layout)).) …
… reading out each feature amount of the learning data from the data memory (Examiner’s note: Under its broadest reasonable interpretation, the term “a discriminator” broadly recites a component that performs a set of steps related to decision tree processing (classification). Owaida teaches a DT-PE that reads a feature from an example stored in the data memory (as shown in the block labeled ‘Read Feature’ in Owaida p.3 Figure 2 (Middle), thus representing a component to “read out each feature amount of the learning data from the data memory”). The scheduler for the software driver executing on the compute-units coordinates the task of parallelizing the processing of different examples (Owaida p.2 col.2 Section III. Classifier Engine Overview 5th paragraph), such that each compute-unit/DT-PE pair performs the functionality of a discriminator (Owaida p.3 Figure 2 (Middle); and p.3 col.2 Section IV.B. DT-PE Datapath: “The DT-PE’s datapath pipeline consists of four operations: … reading the corresponding data example feature from the data memory …”).), and 
based on a branch condition for a node of the decision tree, the branch condition being derived based on the feature amount, discriminating a lower node to which the learning data read out from the data memory is to branch from the node (Examiner’s note: As indicated earlier, Owaida teaches each compute-unit/DT-PE pair performs the functionality of a discriminator through the reading out of feature information from a data memory (Owaida p.3 Figure 2 (Middle); p.2 col.2 Section III. Classifier Engine Overview 5th paragraph; and p.2 col.2 Section III. Classifier Engine Overview 5th paragraph). Owaida teaches the tree node representing a decision node in a decision tree and containing criteria for choosing a left or right child node in the next level (as shown in the block labeled ‘Read Tree Node’ in Owaida p.3 Figure 2 (Middle)), where the criteria stored in the tree node represents thresholds (Owaida p.3 col.1 Section IV.A. DT-PE Memory Layout, 1st paragraph). Owaida teaches a comparison is done with the feature read from the data memory, in order to choose the proper child node address in the next  Owaida p.3 Figure 2 (Middle)). This comparison done with the threshold and the feature to select the child node represents a branch condition, and the result of the comparison (to choose either a left or right child node for a non-leaf node) represents the act of performing a classification to choose the proper child node based on the comparison result (Owaida p.3 Figure 2 (Middle); p.3 col.2 Section IV.B. DT-PE Datapath; p.2 col.1 Section II.B. Decision Tree 1st paragraph; and p.3 col.1 Section IV.A. DT-PE Memory Layout, 1st paragraph).) …  
… a sum total of leaf weights of leaves … stored in the model memory (Examiner’s note: As indicated earlier, Owaida teaches a hybrid CPU-FPGA architecture containing tree memory storing tree data for decision and leaf nodes, where the architecture contains multiple compute-units containing decision tree processing elements that process one or more decision trees, and a reducer unit that sums up the leaf values from all processing elements in the same compute-unit (representing a sum total of leaf weights of leaves) and writes it to a designated shared memory location monitored by the classifier I/O Unit on the FPGA. As indicated earlier, Owaida further teaches that the data transfer between the reducer and the tree memory is controlled by an I/O unit and a combiner that accumulates partial results and writes back the classification results to the shared memory (where the data memory, tree memory, and shared memory were earlier identified collectively representing “model memory”), thus representing “a sum total of leaf weights of leaves … stored in the model memory” (Owaida p.2 col.2 Section III. Classifier Engine Overview 4th-6th paragraphs; p.3 Figure 2 Tree memory layout (Right)).).
Both Chen and Owaida are analogous art since they teach hardware using gradient boosting decision trees.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the plurality of learning units and plurality of memories taught in Chen and apply it to the hybrid CPU-FPGA architecture taught in Owaida as a way to speed up the classification processing of large ensemble decision trees with large input datasets. The motivation to combine is taught in Owaida, as provided in the prior art claim mapping of Claim 1 recited above.
Regarding new Claim 10,
 Chen in view of Owaida teaches
The learning method of claim 9, wherein the plurality of learners are configured to cause the plurality of model memories to store the same data of the learned decision tree (Examiner’s note: Under its broadest reasonable interpretation, this claim limitation exhibits a 112(b) indefiniteness issue, and hence for purposes of examination, this limitation is interpreted as broadly reciting that each decision tree processing is duplicated and processed in a pipeline parallelism fashion (i.e., each model memory stores the learned data that is from the processing of a particular (same) tree vs. processing different trees). As indicated earlier, Owaida teaches each DT-PE stores the node information for the decision and leaf nodes. The decision and leaf nodes contain information such as the gradients and split point values taught in Chen p.3 Algorithm 1 and p.6 col.2 2nd paragraph (Section 4.2 Cache-aware Access), and the split point values represent the branch condition criteria for performing classification (Owaida p.2 Figure 1 and p.2 col.1 Section II.B. Decision Tree 1st paragraph). Collectively, the information stored in the nodes taught in Chen and Owaida correspond to “data of the decision tree learned…”) in tree memory for a single tree or multiple trees, where the tree memory is identified earlier as being one part of “a model memory”, and a plurality of DT-PEs will correspond to “a plurality of model memories”. Owaida teaches mapping of the decision trees such that different trees are not spread across all compute-units/DT-PEs, and instead are cloned across the number of compute-units, resulting in the plurality of tree memories containing the same decision tree information during classification processing, thus representing “wherein the plurality of learning units are configured to cause the plurality of model memories to store the same data of the learned decision tree” (Owaida p.3 col.1 Section IV. Decision Tree Processing Element (DT-PE), 1st-2nd paragraphs; and p.4 col.1 Section V.A. Mapping Trees on FPGA Memory).).  
Regarding new Claim 11, 
Chen in view of Owaida teaches
(New) The learning method of claim 9, wherein each of the plurality of learners comprises a data memory of the plurality of model memories configured to store the learning data (Examiner’s note: As indicated earlier, Owaida teaches each DT-PE contains data memory and tree memory, where the tree memory stores the node information (decision and leaf nodes, thus corresponding to “data of the decision tree learned…”) for a single tree or multiple trees, and the data memory stores the incoming input examples. Both the tree memory and data memory correspond to “model memory” for the decision tree, learning data”. Hence, the arrangement of the plurality of tree and data memories (“a plurality of model memories”) within the corresponding DT-PEs is such that the plurality of data memories are a subset of the plurality of model memories, thus representing “a data memory of the plurality of model memories configured to store the learning data”, which is a re-phrasing of a similar claim limitation found in Claim 1 (Owaida p.3 col.1 3rd paragraph – col.2 2nd paragraph (Section IV.A. DT-PE Memory Layout)).).
Regarding new Claim 14, 
Chen in view of Owaida teaches
(New) The learning method of claim 9, wherein each of the plurality of learners is configured to 
perform learning of a first node using the learning data acquired using a first address related to a storage destination of learning data corresponding to the first node of the decision tree in corresponding one of the plurality of data memories (Examiner’s note: Under its broadest interpretation, this claim limitation encompasses two aspects: “… a first node using the learning data acquired using a first address related to a storage destination of learning data corresponding to the first node of the decision tree” is directed towards accessing the tree memory information, while the term “… in corresponding one of the plurality of data memories” is directed towards accessing the corresponding data example for the decision tree node. As indicated earlier, Owaida teaches each DT-PE contains tree memory containing decision and leaf nodes (where the decision and leaf nodes in the tree memory correspond to “a first node”, “a second node”, “a next node”). The datapath pipeline for each DT-PE involves reading a tree node and reading a data example feature. Reading/writing tree nodes and data examples from/to memory involves accessing a pointer to the memory location (with a pointer corresponding to a memory address). Hence, the software driver executing on a compute-unit will perform two operations: read a parent node from the stored decision tree (corresponding to “… a first node using learning data acquired using a first address related to a storage destination of learning data corresponding to the first node of the decision tree”), and read a corresponding data example feature from the data memory (with the reading of the data example feature corresponding to “… in corresponding one of the plurality of data memories”). Collectively, both operations correspond to “perform learning of a first node using the learning data acquired using a first address related to a storage destination of learning data corresponding to the first node of the decision tree in corresponding one of the plurality of data memories” (Owaida p.3 col.2 Section IV.B. DT-PE Datapath 1st paragraph: “The DT-PE’s datapath pipeline consists of four operations: reading a tree node from the tree memory; reading the corresponding data example feature from the data memory …”).), and 
outputs a second address related to a storage destination of the learning data that branches from the first node (Examiner’s note: As indicated earlier, Owaida teaches each DT-PE contains tree memory containing decision and leaf nodes (where the decision and leaf nodes in the tree memory correspond to “a first node”, “a second node”, “a next node”). The datapath pipeline for each DT-PE involves reading a tree node and reading a data example feature (Owaida p.3 col.2 Section IV.B. DT-PE Datapath 1st paragraph). Owaida uses a formula for computing the pointer (corresponding to a memory address, “a second address”) for a child node relative to the parent node pointer (also corresponding to a memory address, “a first address”, with the relationship between the child node and a parent node being one in which the child node branches from the parent node). Hence this child node pointer corresponds to “output a second address related to a storage destination of the learning data that branches from the first node” (Owaida p.3 col.1 Section IV.A. DT-PE Memory Layout, 1st paragraph).), and 
the learning device further comprises a plurality of managers each corresponding to one of the plurality of learners, each of the manager[s] calculating a third address related to a storage destination of the learning data corresponding to a second node as a next node of the first node using the first address and the second address output from the learner (Examiner’s note: Under its broadest interpretation, “... a third address related to a storage destination of the learning data corresponding to the second node as a next node of the first node using the first address and the second address output from the learning unit” is directed towards accessing the tree memory information. As indicated earlier, Owaida teaches that each DT-PE contains tree memory containing decision and leaf nodes (where the decision and leaf nodes in the tree memory correspond to “a first node”, “a second node”, “a next node”). The datapath pipeline for each DT-PE involves reading a tree node and reading a data example feature (Owaida p.3 col.2 Section IV.B. DT-PE Datapath 1st paragraph). Owaida uses a formula for computing the pointer (corresponding to a memory address, “a second address”) for a child node relative to the parent node pointer (also corresponding to a memory address, “a first address”, with the relationship between the child node and a parent node being one in which the child node branches from the parent node). After Owaida p.3 col.1 Section IV.A. DT-PE Memory Layout, 1st paragraph, this child node pointer is fed back into the first operation (i.e., reading a tree node from the tree memory) as the next decision node pointer for the next tree level. Hence, this next decision node pointer represents “a third address related to a storage destination of the learning data corresponding to a second node as a next node of the first node …”. The determination of this next pointer requires using the formula taught in Owaida p.3 col.1 Section IV.A. DT-PE Memory Layout, 1st paragraph, which is based on using a parent node pointer (“first address”) to produce the child node pointer (“second address output from learning unit”). Hence collectively, this iterative operation of producing next pointers for the next tree level based on the parent and child pointers from the previous tree level represents a process to “… calculate a third address related to a storage destination of the learning data corresponding to a second node as a next node of the first node using the first address and the second address output from the learning unit”. The scheduler for the software driver executing on the compute-units coordinates the task of parallelizing the processing of different examples (Owaida p.2 col.2 Section III. Classifier Engine Overview 5th paragraph), with the software driver calculating node pointers based on the tree data structure (Owaida p.2 Section III. Classifier Engine Overview 3rd paragraph: “The user passes a pointer to the test data, and a data structure describing the tree ensemble model …”), hence each of the compute-units/DT-PEs also functions as an address manager, thus corresponding to “a plurality of managers each corresponding to one of the plurality of learning units, each of the managers calculating a third address …” (Owaida p.3 col.2 Section IV.B. DT-PE Datapath).).  
Regarding new Claim 15, 
Chen in view of Owaida teaches
(New) The learning method of claim 11, wherein the data memory is configured to store the training data divided within (Examiner’s note: Under its broadest reasonable interpretation, the phrase “to store the training data divided with a plurality of features” is interpreted to broadly recite a scenario involving reading in training data containing a large amount of feature information, where the training data is divided such that it requires multiple reads into a data memory. As indicated earlier, Chen teaches the input data is a memory block that can be further sub-sampled into subsets of columns in a block, with each column representing a different piece of memory (where this Chen p.8 col.1 6th paragraph (Section 6.2 Dataset and Setup); p.5 col.2 Section 4.1 Column Block for Parallel Learning: “The most time consuming part of tree learning is to get the data into sorted order. In order to reduce the cost of sorting, we propose to store the data in in-memory units, which we called block. Data in each block is stored in the compressed column (CSC) format, with each column sorted by the corresponding feature value. … we store the entire dataset in a single block and run the split search algorithm by linearly scanning over the pre-sorted entries. …. Collecting statistics for each column can be parallelized, giving us a parallel algorithm for split finding. … the column block structure also support column subsampling, as it is easy to select a subset of columns in a block.”). Owaida further teaches a dedicated data memory with a single read port (256 bits width) that requires stitching together 7 Block RAM units in order to store a maximum of 4096 features, with additional capability of performing pre-fetching next data examples while available data examples are being processed, such that this set of functionality taught in Owaida is interpreted as providing support for multiple reads of training data to read in a training example containing a large amount of feature information (Owaida p.3 col.2 2nd paragraph: “The data memory stores incoming data examples and has a capacity of 4096 features (floating point). The data memory has one write and one read port, allowing pre-fetching the next data examples while available data examples are being processed. … the data memory has to deliver 32 Bytes per cycle to saturate the QPI bandwidth. Hence the data memory has a data line width of 256 bits … which requires stitching together 7 Block RAMs (BRAMS)… to be equal to 4096 features.”).).
Claims 12-13 are rejected under 35 U.S.C. 103 as being unpatentable over 
Chen et al., XGBoost: A Scalable Tree Boosting System, June 10 2016 [hereafter referred as Chen] in view of Owaida et al., Scalable Inference of Decision Tree Ensembles: Flexible Design for CPU-FPGA Platforms, 2017 [hereafter referred as Owaida] as applied to Claim 9; in even further view of Kamiya et al., WO2020090413, priority to JP2018-025795 filed 10/31/2018 [hereafter referred as Kamiya].
Regarding new Claim 12, 
Chen in view of Owaida as applied to Claim 9 teaches
(New) The learning method of claim 9.
However, Chen in view of Owaida does not teach
… calculating an index value of recognition performance of the learned decision tree based on the sample weight corresponding to the learning data stored in the corresponding data memory.  
Kamiya teaches
… calculating an index value of recognition performance of the learned decision tree based on the sample weight corresponding to the learning data stored in the corresponding data memory (Examiner’s note: As indicated earlier, Kamiya teaches a classification device containing a learning device that retrieves feature data to perform a binary classification, and uses an optimization unit that receives this feature data (“learning data stored in the corresponding data memory”) and a corresponding score from a scoring calculation unit (“the sample weight”) to maximize an objective function by performing a calculation that approximates a portion of the AUC used for classifying the data into either a positive example or a negative example. Hence, the CPU implementing an optimization unit (using the scores from the scoring calculation unit) functions as performance calculating circuitry, with the AUC value being defined as a value that measures the correctness (or accuracy) of the positive example or negative example classification that is used as an index of classification performance. Kamiya further teaches the optimization unit is designed such different types of objective functions can be applied include probability gradient descent, or can be rewritten to fit the needs of the problem to be solved (Kamiya [0050]), and hence can be adapted to other binary classification problems. Hence, using this optimization unit in the context of a decision tree classification (where decision tree classification is a form of a binary classification performed between a parent node and its two child nodes) allows this optimization unit to perform “… calculating an index value of recognition performance of the learned decision tree based on the sample weight corresponding to the learning data stored in the corresponding data memory” (Kamiya Figure 4 and [0030]-[0039]; and [0006]).).
Both Chen in view of Owaida and Kamiya are analogous art since they teach using learning devices to perform binary classification.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take decision tree learning and classification algorithm performed in the learning units taught in Chen in view of Owaida and enhance it to include an AUC calculator taught in Kamiya as a 
Regarding new Claim 13, 
Chen in view of Owaida, in further view of Kamiya teaches
(New) The learning method of claim 12, further comprising: calculating an Area Under the Curve (AUC) as the index value (Examiner’s note: As indicated earlier, Kamiya teaches a classification device containing a learning device which retrieves feature data to perform a binary classification, and uses an optimization unit to learn the weight by performing a calculation that approximates a portion of the AUC curve used for classifying the data into either a positive example or a negative example based on scores determined by a scoring calculation unit. Hence, the optimization unit (using the scores from the scoring calculation unit) functions as a performance calculator, with the AUC value being defined as a value that measures the correctness (or accuracy) of the positive example or negative example classification that is used as an index of classification performance (Kamiya paragraph [0006]).).  

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure.
Mitchell et al., Accelerating the XGBoost algorithm using GPU computing, PeerJ Comput. Sci. 3:e127; DOI 10.7717/perj-cs.127, Department of Computer Science, University of Waikato, Hamilton, New Zealand, published July 24 2017, 2017 Mitchell and Frank, 37 pages, where Mitchell teaches using a GPU-based hardware implementation to implement a parallelized version of the XGBoost/gradient boosting decision tree algorithm (pp.6-8 and pp.22-30), where the decision trees are learned on a GPU hardware architecture containing global and shared memory (pp.11-12), with the resulting GPU-based hardware implementation providing parallelization and improved speedup of the XGBoost algorithm (pp.20-22).
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL. See MPEP § 706.07(a). Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  

Any inquiry concerning this communication or earlier communications from the examiner should be directed to WILLIAM WAI YIN KWAN whose telephone number is 303-297-4332. The examiner can normally be reached Monday-Friday 8:00am - 4:30pm PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li B Zhen can be reached on 571-272-3768. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/WILLIAM WAI YIN KWAN/Examiner, Art Unit 2121                               
                                                                                                                                                                         /Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121