DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
The present application was filed on July 29, 2019. 
This office action is in response to Amendments and/or Remarks filed on March 15, 2022. In the current amendment, claims 1, 21, and 40 are amended. Claims 19-20, 22-39, and 41 were previously cancelled. Claims 1-18, 21, and 40 are pending. 
In response to Amendments and/or Remarks filed on March 15, 2022, the 35 U.S.C. 101 rejection applied to claims 1-18, 21, and 40 made in the previous office action have been withdrawn. 

Specification
The abstract of the disclosure is objected to because the abstract is two pages of WO 2018/144534 A1, and is not a separate abstract.  Correction is required.  See MPEP § 608.01(b).


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-11, 17, 18, 21, and 40 are rejected under 35 U.S.C. 103 as being unpatentable over Kim et al. “A HIGHLY SCALABLE RESTRICTED BOLTZMANN MACHINE FPGA IMPLEMENTATION” in view of Dahl et al. “Training Restricted Boltzmann Machines on Word Observations”
Regarding Claim 1, 
Kim teaches: 
A system, comprising: at least one processor; and at least one memory including program code which when executed by the at least one processor provides operations comprising: (Page 368, Section 4.1: “The system consists of a soft processor, a DDR2 SDRAM controller, and an RBM module. The soft processor used in our design is the Altera Nios II operating at a clock frequency of 100 MHz. The processor functions as the interface between the user and the RBM module via JTAGUART2 . The CPU also initializes the weights, reads in the visible neurons to SDRAM, initiates the algorithm, and returns the results to the user” teaches a computer based implementation, including a processor and memory)
partitioning, into a first batch of data and a second batch of data, an input data received at a hardware accelerator implementing a machine learning model, (Fig. 2 and Page 368, Section 3; “Fig. 2 shows the pseudo-code for the RBM training algorithm. The sampling between the hidden and visible layers is followed by a slight modification in the parameters (controlled by the learning rate) and repeated for each data batch in the training set, and for as many epochs as is necessary to reach convergence” teaches partitioning the data from the training set into batches; Page 369, Section 4.1: “Weights and neuron data are distributed across the groups. Each group processes a different portion of the network. Nearly all computations take place in these groups.” teaches that weights are also partitioned into groups, therefore the input data (training data and weights) are partitioned into batches; Page 367, Section 1: “We describe an FPGA-based system that accelerates the training of DBN networks. Our implementation uses a single FPGA, but we have produced an architecture that we believe will be able to scale to many FPGAs, and thus will allow the training of larger DBNs.” teaches an accelerator that uses FPGAs for deep belief networks (machine learning model)) 
the input data comprising a continuous stream of data samples, (Page 370, Section 4.2: “However, if the network size scales to a point that the weight matrix no longer fits on-chip, then the weight matrix has to stream in from off-chip memory.” suggests that weights(input data) can be streamed into the accelerator)
and the input data being partitioned based at least on a resource constraint of the hardware accelerator; (Page 372, Section 6: “Thus, weights will need to be streamed in from external storage such as DRAM. To tackle bandwidth issues, a batch size of at least 16 will be used. This enables weights to be reused for multiple data vectors within the batch to reduce bandwidth, at the cost of slightly increased number of iterations to converge. Our calculations show that for a batch size of 16, only 256 bits of weight data are needed every cycle, which is feasible with a DDR2 interface. ” teaches that the batches of data are partitioned based on constraints such as bandwidth and memory size)
training the machine learning model by at least performing a real time update of [parameters of a Restricted Boltzmann machine] associated with the machine learning model, the [parameters of a Restricted Boltzmann machine] being updated by at least processing, by the hardware accelerator, the first batch of data before the second batch of data; and (Fig. 2: 

    PNG
    media_image1.png
    336
    547
    media_image1.png
    Greyscale

teaches training the restricted Boltzmann machine (machine learning model) by updating the parameters and sampling data from batches and teaches that the system samples data from hid_batch_0 (first batch) before sampling data from vis_batch_1 (second batch) )

applying the machine learning model in parallel with the training of the machine learning model, (Page 369, Section 4.1: “As shown in Fig. 4, the RBM module is segmented into several groups, each consisting of an array of multipliers, adders, embedded RAM, and logic components. Weights and neuron data are distributed across the groups. Each group processes a different portion of the network. Nearly all computations take place in these groups. The rationale for such partitioning is that wire delay increases as semiconductor technology scales, so the wire delay becomes the performance bottleneck if the placement and routing is not performed efficiently. Localization of communication is an efficient way, and possibly the only way, to fully exploit all the parallelism in modem FPGAs. Signals that must communicate with other groups are appropriately buffered.” teaches that the Restricted Boltzmann machine module is segmented into a plurality of groups and that each group processes a different portion of the network and partitioned weights are used to update the parameters of the RBM to exploit the parallelism in FPGAs; Page 370, Section 4.2: “This suggests that each row of W T and each row of H should be placed in separate on-chip RAMs so that all of these elements can be read simultaneously, as shown in Fig. 5b.” teaches that the weights of each group can be accessed simultaneously, therefore updating parameters (training) using a group of weights is performed simultaneously with generating an output (applying the machine learning model))

    PNG
    media_image2.png
    722
    859
    media_image2.png
    Greyscale


Kim does not appear to explicitly teach: 
that the updated parameters are a probability density function
the machine learning model being applied to generate, based at least on the updated probability density function, an output comprising a probability of encountering a data value.

However, Dahl teaches: 
updating a probability density function as part of updating a Restricted Boltzmann machine (Page 2, Section 2: 

    PNG
    media_image3.png
    663
    521
    media_image3.png
    Greyscale

teaches parametrizing the energy of the RBM into a probability density function with bias vectors and weights, therefore a change (update) to the weights of the RBM will result in an update to the probability density function)
the machine learning model being applied to generate, based at least on the updated probability density function, an output comprising a probability of encountering a data value. (Page 2, Section 2: “An RBM defines a distribution over a binary visible vector v of dimensionality V and a layer h of H binary hidden units through an energy… This yields simple conditional distributions:”

    PNG
    media_image4.png
    178
    522
    media_image4.png
    Greyscale

teaches that restricted Boltzmann machines yield conditional probability distributions, therefore an output is generated comprising the probability of a data value)
Kim and Dahl are analogous art because they are directed to Restricted Boltzmann machines. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to update the Restricted Boltzmann machine of Kim using the update to the probability density function of Dahl with a motivation to allow for efficient Gibbs sampling (a Markov Chain Monte Carlo algorithm) of each layer (Dahl, Page 2, Section 2). 


Regarding Claim 2, 
The combination of Kim and Dahl teaches The system of claim 1, 
Kim further teaches: 
wherein the [parameter of the RBM] is updated in real time such that the updating of the [parameter of the RBM] is performed at a same time and/or substantially at the same time as the generation of the output... (Page 369, Section 4.1: “As shown in Fig. 4, the RBM module is segmented into several groups, each consisting of an array of multipliers, adders, embedded RAM, and logic components. Weights and neuron data are distributed across the groups. Each group processes a different portion of the network. Nearly all computations take place in these groups. The rationale for such partitioning is that wire delay increases as semiconductor technology scales, so the wire delay becomes the performance bottleneck if the placement and routing is not performed efficiently. Localization of communication is an efficient way, and possibly the only way, to fully exploit all the parallelism in modem FPGAs. Signals that must communicate with other groups are appropriately buffered.” teaches partitioning weights that are used to update the parameters of the RBM into multiple groups to exploit the parallelism in FPGAs; Page 370, Section 4.2: “This suggests that each row of W T and each row of H should be placed in separate on-chip RAMs so that all of these elements can be read simultaneously, as shown in Fig. 5b.” teaches that the weights of each group can be accessed simultaneously, therefore updating parameters using a group of weights can be performed simultaneously as generating an output)

    PNG
    media_image2.png
    722
    859
    media_image2.png
    Greyscale


Dahl further teaches: 
updating a probability density function as part of updating a Restricted Boltzmann machine Page 2, Section 2: 

    PNG
    media_image3.png
    663
    521
    media_image3.png
    Greyscale

teaches parametrizing the energy of the RBM into a probability density function with bias vectors and weights, therefore a change (update) to the weights of the RBM will result in an update to the probability density function)

generating, based at least on the updated probability density function, an output comprising a probability of encountering a data value. (Page 2, Section 2: “An RBM defines a distribution over a binary visible vector v of dimensionality V and a layer h of H binary hidden units through an energy… This yields simple conditional distributions:”

    PNG
    media_image4.png
    178
    522
    media_image4.png
    Greyscale

teaches that restricted Boltzmann machines yield conditional probability distributions, therefore an output is generated comprising the probability of a data value)

The combination of claim 1 has already incorporated the probability density function and generating of a probability as an output, therefore already incorporating the details of the probability density function required by Claim 2.

Regarding Claim 3, 
The combination of Kim and Dahl teaches The system of claim 1,
Kim further teaches: 
wherein each data sample comprises a plurality of data values corresponding to a plurality of features, and wherein the first batch of data and the second batch of data each comprise some but not all of the plurality of features (Page 368, Section 3: “RBMs, introduced in [I] , are probabilistic generative models that are able to automatically extract features of their input data using a completely unsupervised learning algorithm. RBMs consist of a layer of hidden and a layer of visible neurons with connection strengths between hidden and visible neurons represented by an array of weights (see Fig. I). To train an RBM, samples from a training set are used as input to the RBM through the visible neurons, and then the network alternatively samples back and forth between the visible and hidden neurons” teaches that RBMs extract features from input data; Page 368, Section 3: “Fig. 2 shows the pseudo-code for the RBM training algorithm. The sampling between the hidden and visible layers is followed by a slight modification in the parameters (controlled by the learning rate) and repeated for each data batch in the training set, and for as many epochs as is necessary to reach convergence” teaches partitioning the data from the training set into batches, therefore each batch of input data is used by the RBM to generate a portion of the features)
 
Regarding Claim 4, 
The combination of Kim and Dahl teaches The system of claim 1,
Kim further teaches:
wherein the first batch of data and the second batch of data each comprise some but not all of the data samples included in the input data. (Page 368, Section 3: “Fig. 2 shows the pseudo-code for the RBM training algorithm. The sampling between the hidden and visible layers is followed by a slight modification in the parameters (controlled by the learning rate) and repeated for each data batch in the training set, and for as many epochs as is necessary to reach convergence” teaches partitioning the data from the training set into batches, therefore each batch of input data contains a portion of the training data)

Regarding Claim 5, 
The combination of Kim and Dahl teaches The system of claim 1,
Kim further teaches:
wherein the machine learning model comprises a probabilistic machine learning model configured to perform an inference task (Page 368, Section 3: “RBMs, introduced in [I] , are probabilistic generative models that are able to automatically extract features of their input data using a completely unsupervised learning algorithm. RBMs consist of a layer of hidden and a layer of visible neurons with connection strengths between hidden and visible neurons represented by an array of weights (see Fig. I)” teaches that RBMs are a probabilistic machine learning model used to extract features (perform an inference task)

Regarding Claim 6, 
The combination of Kim and Dahl teaches The system of claim 5,
Kim further teaches:
wherein the probabilistic machine learning model comprises a Bayesian network and/or a belief network (Page 368, Section 3: “The motivation for using RBMs is that when stacked together in a hierarchical fashion, with the hidden units of one RBM used as the visible inputs to the next higher RBM - which describes the architecture of a DBN [I] - one can automatically learn "patterns-of-patterns" of the training set.” and Page 367, Abstract: “Restricted Boltzmann Machines (RBMs) - the building block for newly popular Deep Belief Networks (DBN” teaches that Restricted Boltzmann machines form a deep belief network)

Regarding Claim 7, 
The combination of Kim and Dahl teaches The system of claim 1,
Kim further teaches:
wherein the hardware accelerator processes the first batch of data and/or the second batch of data by at least applying, to the first batch of data and/or the second batch of data… (Page 368, Section 3; “Fig. 2 shows the pseudo-code for the RBM training algorithm. The sampling between the hidden and visible layers is followed by a slight modification in the parameters (controlled by the learning rate) and repeated for each data batch in the training set, and for as many epochs as is necessary to reach convergence” teaches partitioning the data from the training set into batches)

Dahl further teaches:
applying…one or more Markov Chain Monte Carlo techniques (Page 1, Abstract: “The conventional approach to training RBMs on word observations is limited because it requires sampling the states of K-way softmax visible units during block Gibbs updates, an operation that takes time linear in K. In this work, we address this issue with a more general class of Markov chain Monte Carlo operators on the visible units, yielding updates with computational complexity independent of K.” and Page 3, Section 4: “To achieve this, instead of sampling exactly from the conditionals p(v(i) |h) within the Markov chain, we use a small number of iterations of Metropolis–Hastings (M–H) sampling. Let q(vˆ (i) ← v (i) ) be a proposal distribution for group i. The following stochastic operator leaves p(v, h) invariant” teaches using Metropolis-Hastings (a Markov Chain Monte Carlo method) for sampling to update the RBM)

Kim and Dahl are analogous art because they are directed to Restricted Boltzmann machines. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to sample data for training the Restricted Boltzmann machine of Kim using the Metropolis-Hastings algorithm of Dahl with a motivation to efficiently perform updates for large multinomial distributions (Dahl, Page 3 Section 3).

Regarding Claim 8, 
The combination of Kim and Dahl teaches The system of claim 7,
Kim further teaches:
wherein the first batch of data and/or the second batch of data each comprise a matrix, and (Fig. 2 teaches that the batches of data contain matrices)


Dahl further teaches:
wherein the application of the one or more Markov Chain Monte Carlo techniques includes performing a sequence of dot product operations between two or more matrices comprising the first batch of data and/or the second batch of data. (Page 1, Abstract: “The conventional approach to training RBMs on word observations is limited because it requires sampling the states of K-way softmax visible units during block Gibbs updates, an operation that takes time linear in K. In this work, we address this issue with a more general class of Markov chain Monte Carlo operators on the visible units, yielding updates with computational complexity independent of K.” and Page 3, Section 4: “To achieve this, instead of sampling exactly from the conditionals p(v(i) |h) within the Markov chain, we use a small number of iterations of Metropolis–Hastings (M–H) sampling. Let q(vˆ (i) ← v (i) ) be a proposal distribution for group i. The following stochastic operator leaves p(v, h) invariant…”
    PNG
    media_image5.png
    226
    471
    media_image5.png
    Greyscale

 teaches using Metropolis-Hastings (a Markov Chain Monte Carlo method) for sampling to update the RBM and that the Metropolis-Hastings algorithm includes dot product operations between matrices)

The combination of claim 7 has already incorporated the Metropolis-Hastings algorithm (Markov Chain Monte Carlo method), therefore already incorporating the details of the Markov Chain Monte Carlo method required by Claim 8. 

Regarding Claim 9, 
The combination of Kim and Dahl teaches The system of claim 8,
Dahl further teaches: 
…perform the sequence of dot product operations [with the application of the Markov Chain Monte Carlo technique] (Page 1, Abstract: “The conventional approach to training RBMs on word observations is limited because it requires sampling the states of K-way softmax visible units during block Gibbs updates, an operation that takes time linear in K. In this work, we address this issue with a more general class of Markov chain Monte Carlo operators on the visible units, yielding updates with computational complexity independent of K.” and Page 3, Section 4: “To achieve this, instead of sampling exactly from the conditionals p(v(i) |h) within the Markov chain, we use a small number of iterations of Metropolis–Hastings (M–H) sampling. Let q(vˆ (i) ← v (i) ) be a proposal distribution for group i. The following stochastic operator leaves p(v, h) invariant…”
    PNG
    media_image5.png
    226
    471
    media_image5.png
    Greyscale

 teaches using Metropolis-Hastings (a Markov Chain Monte Carlo method) for sampling to update the RBM and that the Metropolis-Hastings algorithm includes dot product operations between matrices)

The combination of claim 7 has already incorporated the Metropolis-Hastings algorithm (Markov Chain Monte Carlo method), therefore already incorporating the details of performing the sequence of dot product operations required by Claim 9. 

The combination of Kim and Dahl does not appear to explicitly teach: 
wherein the hardware accelerator includes a tree adder configured to perform the sequence of dot product operations by at least performing, in parallel, at least a portion of a plurality of addition operations and/or multiplication operations comprising the sequence of dot product operations

However, Kim teaches: 
wherein the hardware accelerator includes a tree adder.. by at least performing, in parallel, at least a portion of a plurality of addition operations and/or multiplication operations… (Fig. 5b: 

    PNG
    media_image6.png
    525
    547
    media_image6.png
    Greyscale

and Page 370, Section 4.2: “Matrix multiplication occurs in all three phases: the hidden and visible neuron sampling phases and the weight update phase. Thus, the input for the multiplication operations, which are the weights and the neurons, should reside in the embedded memories distributed across the FPGA. Although locating the inputs close to the multipliers is desirable, distribution of the weights is non-trivial due to a transpose operation that occurs during the visible neuron sampling phase.” teaches that the hardware accelerator includes a Tree Adder that performs matrix multiplication (comprises dot product operations) in parallel)

Kim and Dahl are analogous art because they are directed to Restricted Boltzmann machines. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to perform the matrix multiplications associated with the Markov Chain Monte Carlo technique of Kim/Dahl using the tree adder of Kim with a motivation to efficiently perform matrix multiplication (Kim, Page 370).


Regarding Claim 10, 
The combination of Kim and Dahl teaches The system of claim 8,
Kim further teaches:
wherein the probability of encountering the data value changes upon processing the second batch of data, and wherein the output includes a first probability of encountering the data value given the first batch of data and a second probability of encountering the data value given the second batch of data. (Fig. 2 teaches training the restricted Boltzmann machine by updating the parameters and sampling data from batches, therefore the RBM including the probability distributions are updated upon processing the batches of data)

Regarding Claim 11, 
The combination of Kim and Dahl teaches The system of claim 1,
Kim further teaches:
wherein the hardware accelerator comprises one or more application specific integrated circuits (ASICs) and/or field programmable gate arrays (FPGAs ). (Page 368, Section 4: “We have implemented a Restricted Boltzmann Machine on a development board that features an Altera Stratix III FPGA with a DDR2 SDRAM SODIMM interface. The Stratix III EP3SL340 has 135,000 ALMs (Adaptive Logic Modules)' , 16,272 kbits of embedded RAM and 288 embedded 18x18 multipliers. With this number of multipliers, we are capable of processing approximately 256 neurons per clock cycle. The FPGA board can also be connected to up to 19 other boards in a stack via a high speed interface – this will be used in future work to scale up the size of the DBNs that can be processed.” teaches that the accelerator can include one or more FPGAs)

Regarding Claim 17, 
The combination of Kim and Dahl teaches The system of claim 1,
Kim further teaches:
wherein the partitioning of the input data is further based at least on a dimensionality of the input data… (Page 372, Section 6: “The weight matrix will no longer fit in on-chip memory since it scales as O( n2 ) with the number of neurons. Thus, weights will need to be streamed in from external storage such as DRAM. To tackle bandwidth issues, a batch size of at least 16 will be used.” teaches that partitioning the input data into batches depends on the size (dimensionality) of the input data)

Regarding Claim 18, 
The combination of Kim and Dahl teaches The system of claim 1,
Kim further teaches:
dividing, into a first portion of data and a second portion of data, the first batch of data; and (Fig. 2 and Page 368, Section 3; “Fig. 2 shows the pseudo-code for the RBM training algorithm. The sampling between the hidden and visible layers is followed by a slight modification in the parameters (controlled by the learning rate) and repeated for each data batch in the training set, and for as many epochs as is necessary to reach convergence” teaches partitioning the data from the training set into batches; Page 369, Section 4.1: “Weights and neuron data are distributed across the groups. Each group processes a different portion of the network. Nearly all computations take place in these groups.” teaches that weights are also partitioned into groups, therefore the input data (training data and weights) are partitioned into batches)
storing the first portion of data and the second portion of data in different memory blocks to at least enable the first portion of data and the second portion of data to be accessed simultaneously for processing by the hardware accelerator during the update of the probability density function. (Page 372, Section 6: “The weight matrix will no longer fit in on-chip memory since it scales as O( n2 ) with the number of neurons. Thus, weights will need to be streamed in from external storage such as DRAM.” teaches that the weights of the batches may need to be stored in a different DRAM module (different memory block) to enable streaming the weights into the accelerator; Page 369, Section 4.1 and Figure 4: “As shown in Fig. 4, the RBM module is segmented into several groups, each consisting of an array of multipliers, adders, embedded RAM, and logic components. Weights and neuron data are distributed across the groups. Each group processes a different portion of the network… Localization of communication is an efficient way, and possibly the only way, to fully exploit all the parallelism in modem FPGAs. Signals that must communicate with other groups are appropriately buffered.” and Page 370, Section 4.2: “This suggests that each row of W T and each row of H should be placed in separate on-chip RAMs so that all of these elements can be read simultaneously, as shown in Fig. 5b.” teaches that different partitions of weights (data) can be accessed by the accelerator in parallel (simultaneously))

    PNG
    media_image2.png
    722
    859
    media_image2.png
    Greyscale


Regarding Claim 21, 
Kim teaches: 
A computer-implemented method, comprising: (Page 368, Section 4.1: “The system consists of a soft processor, a DDR2 SDRAM controller, and an RBM module. The soft processor used in our design is the Altera Nios II operating at a clock frequency of 100 MHz. The processor functions as the interface between the user and the RBM module via JTAGUART2 . The CPU also initializes the weights, reads in the visible neurons to SDRAM, initiates the algorithm, and returns the results to the user” teaches a computer based implementation, including a processor and memory)
partitioning, into a first batch of data and a second batch of data, an input data received at a hardware accelerator implementing a machine learning model, (Fig. 2 and Page 368, Section 3; “Fig. 2 shows the pseudo-code for the RBM training algorithm. The sampling between the hidden and visible layers is followed by a slight modification in the parameters (controlled by the learning rate) and repeated for each data batch in the training set, and for as many epochs as is necessary to reach convergence” teaches partitioning the data from the training set into batches; Page 369, Section 4.1: “Weights and neuron data are distributed across the groups. Each group processes a different portion of the network. Nearly all computations take place in these groups.” teaches that weights are also partitioned into groups, therefore the input data (training data and weights) are partitioned into batches; Page 367, Section 1: “We describe an FPGA-based system that accelerates the training of DBN networks. Our implementation uses a single FPGA, but we have produced an architecture that we believe will be able to scale to many FPGAs, and thus will allow the training of larger DBNs.” teaches an accelerator that uses FPGAs for deep belief networks (machine learning model)) 
the input data comprising a continuous stream of data samples, (Page 370, Section 4.2: “However, if the network size scales to a point that the weight matrix no longer fits on-chip, then the weight matrix has to stream in from off-chip memory.” suggests that weights(input data) can be streamed into the accelerator)
and the input data being partitioned based at least on a resource constraint of the hardware accelerator; (Page 372, Section 6: “Thus, weights will need to be streamed in from external storage such as DRAM. To tackle bandwidth issues, a batch size of at least 16 will be used. This enables weights to be reused for multiple data vectors within the batch to reduce bandwidth, at the cost of slightly increased number of iterations to converge. Our calculations show that for a batch size of 16, only 256 bits of weight data are needed every cycle, which is feasible with a DDR2 interface. ” teaches that the batches of data are partitioned based on constraints such as bandwidth and memory size )
training the machine learning model by at least performing a real time update of [parameters of a Restricted Boltzmann machine] associated with the machine learning model, the [parameters of a Restricted Boltzmann machine] being updated by at least processing, by the hardware accelerator, the first batch of data before the second batch of data; and (Fig. 2: 
    PNG
    media_image1.png
    336
    547
    media_image1.png
    Greyscale

teaches training the restricted Boltzmann machine by updating the parameters and sampling data from batches; teaches that the system samples data from hid_batch_0 (first batch) before sampling data from vis_batch_1 (second batch); Page 370, Section 4.2: “However, if the network size scales to a point that the weight matrix no longer fits on-chip, then the weight matrix has to stream in from off-chip memory… Although our current RBM implementation also assumes that the weight matrix fits on-chip, our approach solves the transpose problem in a way that scales to large networks where the weight matrix is stored on off chip DRAM.” teaches that the accelerator can stream weights used to update the RBM (and its associated probability density function) for large networks, therefore the update to the probability density function and determination of a probability value can be performed in real time)

applying the machine learning model in parallel with the training of the machine learning model, (Page 369, Section 4.1: “As shown in Fig. 4, the RBM module is segmented into several groups, each consisting of an array of multipliers, adders, embedded RAM, and logic components. Weights and neuron data are distributed across the groups. Each group processes a different portion of the network. Nearly all computations take place in these groups. The rationale for such partitioning is that wire delay increases as semiconductor technology scales, so the wire delay becomes the performance bottleneck if the placement and routing is not performed efficiently. Localization of communication is an efficient way, and possibly the only way, to fully exploit all the parallelism in modem FPGAs. Signals that must communicate with other groups are appropriately buffered.” teaches that the Restricted Boltzmann machine module is segmented into a plurality of groups and that each group processes a different portion of the network and partitioned weights are used to update the parameters of the RBM to exploit the parallelism in FPGAs; Page 370, Section 4.2: “This suggests that each row of W T and each row of H should be placed in separate on-chip RAMs so that all of these elements can be read simultaneously, as shown in Fig. 5b.” teaches that the weights of each group can be accessed simultaneously, therefore updating parameters (training) using a group of weights is performed simultaneously with generating an output (applying the machine learning model))

    PNG
    media_image2.png
    722
    859
    media_image2.png
    Greyscale



Kim does not appear to explicitly teach: 
that the updated parameters are a probability density function
the machine learning model being applied to generate, based at least on the updated probability density function, an output comprising a probability of encountering a data value.

However, Dahl teaches: 
updating a probability density function as part of updating a Restricted Boltzmann machine (Page 2, Section 2: 

    PNG
    media_image3.png
    663
    521
    media_image3.png
    Greyscale

teaches parametrizing the energy of the RBM into a probability density function with bias vectors and weights, therefore a change (update) to the weights of the RBM will result in an update to the probability density function)
the machine learning model being applied to generate, based at least on the updated probability density function, an output comprising a probability of encountering a data value. (Page 2, Section 2: “An RBM defines a distribution over a binary visible vector v of dimensionality V and a layer h of H binary hidden units through an energy… This yields simple conditional distributions:”

    PNG
    media_image4.png
    178
    522
    media_image4.png
    Greyscale

teaches that restricted Boltzmann machines yield conditional probability distributions, therefore an output is generated comprising the probability of a data value)
Kim and Dahl are analogous art because they are directed to restricted Boltzmann machines. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to update the Restricted Boltzmann machine of Kim using the update to the probability density function of Dahl with a motivation to allow for efficient Gibbs sampling (a Markov Chain Monte Carlo algorithm) of each layer (Dahl, Page 2, Section 2).

Regarding Claim 40, 
This claim recites A non-transitory computer readable medium storing instructions…, which performs a plurality of operations as recited by the method of claim 21, and has limitations that are similar to the method of claim 21, thus is rejected with the same rationale applied against claim 21.

Claims 12-14 are rejected under 35 U.S.C. 103 as being unpatentable over Kim in view of Dahl, further in view of Dahl-Ranzato et al. (“Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine”)

Regarding Claim 12, 
The combination of Kim and Dahl teaches The system of claim 1,
Dahl further teaches: 
wherein the probability density function includes a predictive function, (Page 2, Section 2: 
    PNG
    media_image7.png
    202
    547
    media_image7.png
    Greyscale

teaches that the probability density function contains predictive function Z)
and wherein the prior distribution of the input data indicates the probability of encountering the data value without taking into account the first batch of data and/or the second batch of data (Page 4, Section 4.2: “We analytically computed the distributions implied by iterations of the M–H operator, assuming the initial state was drawn according to Q I q(v(i) ). As this computation requires the instantiation of n 100k × 100k matrices, it cannot be done at training time, but was done offline for analysis purposes. Each application of Metropolis–Hastings results in a new distribution converging to the target (true) conditional” teaches that the Metropolis Hastings algorithm updates the distribution using training data, therefore the prior distribution is associated with a probability value before the training data is received)
The combination of claim 1 has already incorporated the predictive function, therefore already incorporating the details of the predictive function and prior distribution required by Claim 12. 

The combination of Kim and Dahl does not appear to explicitly teach: 
wherein the predictive function is associated with a mean and a covariance of a prior distribution of the input data,

However, Dahl-Ranzato teaches: 
wherein the predictive function is associated with a mean and a covariance of a prior distribution of the input data, (Page 3, Section 3: “Another option for learning to extract binary features from real-valued data that has enjoyed success in vision applications is the mean-covariance RBM (mcRBM), first introduced in [10] and [6]. The mcRBM has two groups of hidden units: mean units and precision units. Without the precision units, the mcRBM would be identical to a GRBM. With only the precision units, we have what we will call the “cRBM”, following the terminology in [6]. The precision units are designed to enforce smoothness constraints in the data, but when one of these constraints is seriously violated, it is removed by turning off the precision unit. The set of active precision units therefore specifies a sample-specific covariance matrix.” teaches that the RBM (and associated predictive function) is a mean-covariance RBM that uses both the mean and covariance to produce conditional distributions)
Kim, Dahl, and Dahl-Ranzato are analogous art because they are directed to Restricted Boltzmann machines. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to replace the generic RBM of Kim/Dahl with a mean-covariance RBM of Dahl-Ranzato with a motivation to obtain a much more representationally efficient and powerful way of modeling data (Dahl-Ranzato, Page 1 Abstract).

Regarding Claim 13, 
The combination of Kim, Dahl, and Dahl-Ranzato teaches The system of claim 12,
Dahl-Ranzato further teaches: 
wherein the update to the probability density function comprises updating, based at least on the first batch of data and/or the second batch of data, the mean and/or the covariance of the prior distribution. (Page 3, Section 3: “Another option for learning to extract binary features from real-valued data that has enjoyed success in vision applications is the mean-covariance RBM (mcRBM), first introduced in [10] and [6]. The mcRBM has two groups of hidden units: mean units and precision units. Without the precision units, the mcRBM would be identical to a GRBM. With only the precision units, we have what we will call the “cRBM”, following the terminology in [6]. The precision units are designed to enforce smoothness constraints in the data, but when one of these constraints is seriously violated, it is removed by turning off the precision unit. The set of active precision units therefore specifies a sample-specific covariance matrix.” teaches that the mcRBM contains covariance matrices; Page 4, Section 3: “Just like other RBMs, the mcRBM can be trained using the following update rule, for some generic model parameter…” teaches that the mcRBM can be trained by updating the distributions, therefore the covariance matrix is also updated)
The combination of claim 12 has already incorporated the mean-covariance RBM, therefore already incorporating the details of the mean-covariance RBM required by claim 13. 

Regarding Claim 14, 
The combination of Kim, Dahl, and Dahl-Ranzato teaches The system of claim 12,
Dahl further teaches:
and wherein the posterior distribution of the input data indicates the probability of encountering the data value given the first batch of data and/or the second batch of data. (Page 4, Section 4.2: “We analytically computed the distributions implied by iterations of the M–H operator, assuming the initial state was drawn according to Q I q(v(i) ). As this computation requires the instantiation of n 100k × 100k matrices, it cannot be done at training time, but was done offline for analysis purposes. Each application of Metropolis–Hastings results in a new distribution converging to the target (true) conditional” teaches that the Metropolis Hastings algorithm updates the distribution using training data, therefore the posterior distribution can be associated with a probability value by training the RBM with the training data)
The combination of claim 7 has already incorporated the Metropolis-Hastings algorithm (Markov Chain Monte Carlo method), therefore already incorporating the details of the posterior distribution required by Claim 14.

Dahl-Ranzato further teaches: 
wherein the update to the probability density function further comprises determining, based at least on the prior distribution, a gradient of a posterior distribution of the input data, (Page 4, Section 3: “The gradient of the EM term moves the minimum of EMC away from the zero vector, but how far it moves depends on the curvature of the precision matrix defined by EC . The resulting conditional distribution over the visible units, given the two sets of hidden units is: P(v|h, m) ∝ N (ΣWm, Σ” teaches determining a gradient of the conditional distribution (posterior distribution) over the visible units)
The combination of claim 12 has already incorporated the mean-covariance RBM, therefore already incorporating the details of the mean-covariance RBM and gradient of a posterior distribution required by claim 14. 

Claims 15, 16 are rejected under 35 U.S.C. 103 as being unpatentable over Kim in view of Dahl, further in view of Dahl-Ranzato, further in view of Farina (“Algorithms for Real-Time Processing”)

Regarding Claim 15, 
The combination of Kim, Dahl, and Dahl-Ranzato teaches The system of claim 14,
The combination of Kim, Dahl, and Dahl-Ranzato does not appear to explicitly teach: 
wherein the determination of the gradient includes computing an inverse of a covariance matrix corresponding to the covariance of the prior distribution, wherein the inverse of the covariance matrix is computed by at least performing a plurality of QR decompositions, and wherein the plurality of QR decompositions are performed to compute an inverse of an upper triangular matrix R. 

However, Farina teaches: 
wherein the determination of the gradient includes computing an inverse of a covariance matrix corresponding to the covariance of the prior distribution, wherein the inverse of the covariance matrix is computed by at least performing a plurality of QR decompositions, and wherein the plurality of QR decompositions are performed to compute an inverse of an upper triangular matrix R. (Page 2, Section 2: 
    PNG
    media_image8.png
    488
    1084
    media_image8.png
    Greyscale
teaches using QRD (QR decomposition) to determine the inverse of the covariance matrix without having to directly inverse the covariance matrix and that the QRD transforms the matrix into an upper triangular matrix R)
Kim, Dahl, Dahl-Ranzato, and Farina are analogous art because they are directed to models having probability density functions. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to use Farina’s method of QR decomposition to approximate the inverse of the covariance matrix of Kim/Dahl/Dahl-Ranzato with a motivation to avoid high computational cost associated with the inversion of the covariance matrix (Farina, Page 2).


Regarding Claim 16, 
The combination of Kim, Dahl, Dahl-Ranzato, and Farina teaches The system of claim 15,
Farina further teaches: 
wherein the hardware accelerator is configured to compute the inverse of the upper triangular matrix R by at least performing back-substitution. (Page 2, Section 2: “Taking the data matrix Z and operating on it with unitary (i.e. covariance preserving) matrix Q (with dimension nxn) we are able to transform the matrix Z in an upper triangular matrix R… which is now easily solved by forward and back-substitution steps as follows. Indicating by a new vector t the product R w, equation (4) becomes: RH t = s*” teaches computing the inverse of the upper triangular matrix by performing back substitution)
The combination of claim 15 has already incorporated Farina’s method of QR decomposition, therefore already incorporating the details of the back substitution required by claim 16. 

Response to Arguments
Regarding Objection to the Specification
Applicant’s argument: 
“The abstract of the disclosure is objected to for being two pages of the published international application and not a separate abstract. Applicant respectfully disagrees. A separate abstract of the disclosure is found on the first page of the published international application, as is convention for all international application publications. As such, Applicant respectfully submits that this objection is in error and requests withdrawal of the objection.”
Response: 
Applicant’s arguments have been fully considered but are not persuasive. Examiner respectfully disagrees. The abstract of the disclosure, filed on July 29, 2019, contains the first two pages of WO 2018/144534 A1. This is not a separate abstract that is only one paragraph. Please see MPEP 608.01(b). Additionally, The abstract of the disclosure does not commence on a separate sheet in accordance with 37 CFR 1.52(b)(4) and 1.72(b). A new abstract of the disclosure is required and must be presented on a separate sheet, apart from any other text.

Regarding 35 U.S.C. 101
Applicant’s argument: 
“Claims 1-18, 21, and 40 are clearly patent eligible in light of the Office’s Revised Guidance because the claims integrate any alleged metal process or mathematical concept into a practical application for enabling a probabilistic machine learning model to support real time applications that require the training and the application of the machine learning model to occur in parallel.”

Response: 
The 35 U.S.C. 101 rejection applied to claims 1-18, 21, and 40 made in the previous office action, have been withdrawn due to amendments to independent claims 1, 21, and 40. 

Regarding 35 U.S.C. 103
Applicant’s argument: 
“At best, Kim and Dahl describes techniques to expedite the training of a Boltzmann machine. However, nothing in either Kim or Dahl disclose or suggest implementing a machine learning model on a hardware accelerator such that the training of the machine learning model and the application of the machine learning model to generate an output can occur in parallel.”

Response: 
Applicant’s arguments have been fully considered but are not persuasive. Kim teaches that the Restricted Boltzmann machine module is segmented into a plurality of groups and that each group processes a different portion of the network. This is done to take advantage of the parallelism of FPGAs (hardware accelerator). Because the weights used to train the RBM are partitioned among the different groups, training is done in parallel. Training the network involves applying the model, so training and applying are performed simultaneously, i.e. in parallel.  Please see pages 5-6 of this office action for more detail. 

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SHOUN ABRAHAM whose telephone number is (571)272-8144. The examiner can normally be reached Mon - Fri 08:00-16:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar can be reached on (571) 272-7796. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/S.J.A./Examiner, Art Unit 2125               

/BRIAN M SMITH/Primary Examiner, Art Unit 2122