Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This is the initial office action that has been issued in response to patent application 16/106,703 filed on 08/21/2018. Claims 1-34, as originally filed, are currently pending and have been considered below. Claim 1, 18, and 31 are independent claims.

Information Disclosure Statement
The information disclosure statement (IDS) are submitted on 01/31/2019 and 06/07/2019.  The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Priority
Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55.
Applicant cannot rely upon the certified copy of the foreign priority application to overcome this rejection because a translation of said application has not been made of record in accordance with 37 CFR 1.55. See MPEP §§ 215 and 216.
In particular, Applicant is reminded of requirements set forth in 27 C.F.R. 1.55(g)(3)-(4) Claim for foreign priority:
“(3) An English language translation on a non-English language foreign application is not required except:
When the application is involved in an inference (see § 41.202 of this chapter) or  derivation (see part 42 of this chapter) proceeding;
When necessary to overcome the date of a reference relied upon by the examiner; or 
When specifically required by the examiner.
(4) If an English language translation of a non-English language foreign application is required, it must be filed together with a statement that the translation of the certified copy is accurate” (emphasis added).
	Since an English language translation of Application No. KR10-2017-0137374 has not been made of record, the Examiner notes that prior art references with filing date or publication date prior to the instant Application’s filing date of 08/21/2018 are considered applicable prior art references.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have 

Claims 1, 5-6, 17-18, 22-23, and 30-32 are rejected under 35 U.S.C. 103 as being unpatentable over Chilimbi et al. (US 20150324690 A1) in view of Ricks et al. (“Training a Quantum Neural Network”)
Regarding Claim 1,
Chilimbi et al. teaches a processor-implemented neural network method, the method comprising (Chilimbi et al., FIG. 5 and Para. [0048], “Processing unit(s) 612 and can represent, for example, a CPU-type processing unit, a GPU-type processing unit, a field-programmable gate array (FPGA), another class of digital signal processor (DSP), or other hardware logic components that may, in some instances, be driven by a CPU. For example, and without limitation, illustrative types of hardware logic components that can be used include Application-Specific Integrated Circuits (ASICs), Application-Specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), etc. In various embodiments, the processing unit(s) 612 may execute one or more modules and/or processes to cause the server(s) and other machines 610 to perform a variety of functions, as set forth above and explained in further detail in the following disclosure. Additionally, each of the processing unit(s) 612 may possess its own local memory, which also may store program modules, program data, and/or one or more operating systems” teaches a method for training a neural network that is implemented by a processing unit).
Chilimbi et al., Para. [0086], “Block 912 illustrates updating the individual weight values to generate updated individual weight values. The updating may be the result of asynchronous communication between the replicas 704A-704N and the global parameter server(s) 706. As described above, the communications may be asynchronous such that individual replicas 704A-704N communicate independently with the global parameter server(s) 706. The different replicas 704A-704N may communicate at different rates with the global parameter server(s) 706. The rates may be based on predetermined time intervals or may be responsive to the replicas 704A-704N processing a predetermined number of the individual data items” teaches generating an updated individual weight value from the results of the asynchronous communication between the replicas (corresponds to the connection relationship between nodes in a neural network)).
generating an accumulated update value by accumulating the individual update values in an accumulation buffer (Chilimbi et al., Para. [0064], “Two different communication protocols for updating parameter weights are described herein. In one embodiment, a communication protocol locally computes and accumulates the weight updates in a buffer that is periodically sent to the global parameter server(s) 706 when a predetermined number, e.g., “k” (which is typically in the hundreds to thousands) of images (e.g., data items) have been processed. This communication is shown by arrows 712 in FIG. 7. The global parameter server(s) 706 then directly apply these accumulated updates to the stored weights. This works well for the convolutional layers since the volume of weights is low due to weight sharing” teaches determining the accumulated updates (corresponds to the accumulated update value) by accumulating the weight updates (corresponds to accumulating the individual update values) in a buffer (corresponds to the accumulation buffer)).
Chilimbi et al. does not appear to explicitly teach training the neural network by updating the weight using the accumulated update value in response to the accumulated update value being equal to or greater than a threshold value
However, Ricks et al., teaches training the neural network by updating the weight using the accumulated update value in response to the accumulated update value being equal to or greater than a threshold value (Ricks et al., Section 3 Pg. 4, “The QNN in Figure 1 is an example of such a network, with sufficient complexity to compute the XOR function. Each input node i is represented by a register, |αi i . The two hidden nodes compute a weighted sum of the inputs, |ψi i1 and |ψi i2, and compare the sum to a threshold weight, |ψi i0. If the weighted sum is greater than the threshold the node goes high. The |βik represent internal calculations that take place at each node. The output layer works similarly, taking a weighted sum of the hidden nodes and checking against a threshold. The QNN then checks each computed output and compares it to the target output, |Ωi j sending |ϕi j high when they are equivalent. The performance of the network is denoted by |ρi, which is the number of computed outputs equivalent to their corresponding target output.” teaches updating the weight using the weighted sum (corresponds to 
Chilimbi et al. in view of Ricks et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Chilimbi et al. with Ricks et al., with motivation of training the neural network by updating the weight using the accumulated update value in response to the accumulated update value being equal to or greater than a threshold value. “A randomized version avoids some of the exponential increases in complexity with problem size. This algorithm is exponential in the number of qubits of each node’s weight vector instead of in the composite weight vector of the entire network. This means the complexity of the algorithm increases with the number of connections to a node and the precision of each individual weight, dramatically decreasing complexity for problems with large numbers of nodes. This could be a great improvement for larger problems. Preliminary results for both algorithms have been very positive” (Ricks et al., Conclusion). The proposed teaching is beneficial in that it dramatically decreases complexity for problems with large numbers of nodes.
Regarding Claim 5,
Chilimbi et al. in view of Ricks et al. teaches the method of claim 1, further comprising
Ricks et al. further teaches determining whether the accumulated update value is equal to or greater than the threshold value at a predetermined update period (Ricks et al., Section 3 Pg. 4, “The QNN in Figure 1 is an example of such a network, with sufficient complexity to compute the XOR function. Each input node i is represented by a register, |αi i. The two hidden nodes compute a weighted sum of the inputs, |ψi i1 and |ψi i2, and compare the sum to a threshold weight, |ψi i0. If the weighted sum is greater than the threshold the node goes high. The |βik represent internal calculations that take place at each node. The output layer works similarly, taking a weighted sum of the hidden nodes and checking against a threshold. The QNN then checks each computed output and compares it to the target output, |Ωi j sending |ϕi j high when they are equivalent. The performance of the network is denoted by |ρi, which is the number of computed outputs equivalent to their corresponding target output” teaches updating the weight (corresponds to updated value) using the weighted sum (corresponds to the accumulated update value) and comparing the weighted sum to the threshold value during the hidden nodes and output layer (corresponds to the predetermined update period)).  
Chilimbi et al. in view of Ricks et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Chilimbi et al. with Ricks et al., with motivation of determining whether the accumulated update value is equal to or greater than the threshold value at a predetermined update period. “A randomized version avoids some of the exponential increases in complexity with problem size. This algorithm is exponential in the number 
Regarding Claim 6,
Chilimbi et al. in view of Ricks et al. teaches the method of claim 5, further comprising
Ricks et al. further teaches accumulating the individual update values in the accumulation buffer until a next update period in response to the accumulated update value being smaller than the threshold value (Ricks et al., Section 3 Pg. 4, “We propose a QNN that operates much like a classical ANN composed of several layers of perceptrons – an input layer, one or more hidden layers and an output layer. Each layer is fully connected to the previous layer. Each hidden layer computes a weighted sum of the outputs of the previous layer. If this is sum above a threshold, the node goes high, otherwise it stays low. The output layer does the same thing as the hidden layer(s), except that it also checks its accuracy against the target output of the network. The network as a whole computes a function by checking which output bit is high. There are no checks to make sure exactly one output is high. This allows the network to learn data sets which have one output high or binary-encoded outputs” teaches the hidden nodes (corresponds 
Chilimbi et al. in view of Ricks et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Chilimbi et al. with Ricks et al., with motivation of accumulating the individual update values in the accumulation buffer until a next update period in response to the accumulated update value being smaller than the threshold value. “A randomized version avoids some of the exponential increases in complexity with problem size. This algorithm is exponential in the number of qubits of each node’s weight vector instead of in the composite weight vector of the entire network. This means the complexity of the algorithm increases with the number of connections to a node and the precision of each individual weight, dramatically decreasing complexity for problems with large numbers of nodes. This could be a great improvement for larger problems. Preliminary results for both algorithms have been very positive” (Ricks et al., Conclusion). The proposed teaching is beneficial in that it dramatically decreases complexity for problems with large numbers of nodes.
Regarding Claim 17,
Chilimbi et al. in view of Ricks et al. teaches the processor to perform the method of claim 1 
Chilimbi et al. further teaches a non-transitory computer-readable storage medium storing instructions that, when executed by a processor, cause (Chilimbi et al., Para. [0049], “For example, the computer-readable media 614 may include the deep learning training module 616, the model module 618, and other modules. The modules (e.g., 616, 618, etc.) can be implemented as computer-readable instructions, various data structures, and so forth via at least one processing unit(s) 612 to configure a device to execute instructions and to perform operations implementing. Functionality to perform these operations may be included in multiple devices or a single device.” teaches the computer-readable media (corresponds to computer-readable storage medium) that stores instructions that are executed by the processing unit. Para. [0050], “Computer storage media can include volatile memory, nonvolatile memory, and/or other persistent and/or auxiliary computer storage media, removable and non-removable computer storage media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules, or other data. Computer memory is an example of computer storage media. Thus, computer storage media includes tangible and/or physical forms of media included in a device and/or hardware component that is part of a device or external to a device, including but not limited to random-access memory (RAM), static random-access memory (SRAM), dynamic random-access memory (DRAM), phase change memory (PRAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), flash memory, compact disc read-only memory (CD-ROM), digital versatile disks (DVDs), optical cards or other optical storage media, miniature hard drives, memory cards, magnetic cassettes, magnetic tape, magnetic disk storage, magnetic cards or other magnetic storage devices or media, solid-state memory devices, storage arrays, network attached storage, storage area networks, hosted computer storage or any other storage memory, storage device, and/or storage medium that can be used to store and maintain information for access by a computing device” teaches the non-transitory computer-readable media).  
Regarding Claim 18,
Chilimbi et al. teaches a neural network apparatus, the apparatus comprising: one or more processors configured to (Chilimbi et al., FIG. 5 and Para. [0048], “Processing unit(s) 612 and can represent, for example, a CPU-type processing unit, a GPU-type processing unit, a field-programmable gate array (FPGA), another class of digital signal processor (DSP), or other hardware logic components that may, in some instances, be driven by a CPU. For example, and without limitation, illustrative types of hardware logic components that can be used include Application-Specific Integrated Circuits (ASICs), Application-Specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), etc. In various embodiments, the processing unit(s) 612 may execute one or more modules and/or processes to cause the server(s) and other machines 610 to perform a variety of functions, as set forth above and explained in further detail in the following disclosure. Additionally, each of the processing unit(s) 612 may possess its own local memory, which also may store program modules, program data, and/or one or more operating systems” teaches training of a neural network that is implemented by one or more processing unit).
calculate individual update values for a weight assigned to a connection relationship between nodes included in a neural network (Chilimbi et al., Para. [0086], “Block 912 illustrates updating the individual weight values to generate updated individual weight values. The updating may be the result of asynchronous communication between the replicas 704A-704N and the global parameter server(s) 706. As described above, the communications may be asynchronous such that individual replicas 704A-704N communicate independently with the global parameter server(s) 706. The different replicas 704A-704N may communicate at different rates with the global parameter server(s) 706. The rates may be based on predetermined time intervals or may be responsive to the replicas 704A-704N processing a predetermined number of the individual data items” teaches generating an updated individual weight value from the results of the asynchronous communication between the replicas (corresponds to the connection relationship between nodes in a neural network)).
generate an accumulated update value by accumulating the individual update values in an accumulation buffer (Chilimbi et al., Para. [0064], “Two different communication protocols for updating parameter weights are described herein. In one embodiment, a communication protocol locally computes and accumulates the weight updates in a buffer that is periodically sent to the global parameter server(s) 706 when a predetermined number, e.g., “k” (which is typically in the hundreds to thousands) of images (e.g., data items) have been processed. This communication is shown by arrows 712 in FIG. 7. The global parameter server(s) 706 then directly apply these accumulated updates to the stored weights. This works well for the convolutional layers since the volume of weights is low due to weight sharing” teaches determining the accumulated updates (corresponds to the accumulated update value) by accumulating the weight updates (corresponds to accumulating the individual update values) in a buffer (corresponds to the accumulation buffer)).
Chilimbi et al. does not appear to explicitly teach train the neural network by updating the weight using the accumulated update value in response to the accumulated update value being equal to or greater than a threshold value
However, Ricks et al., teaches train the neural network by updating the weight using the accumulated update value in response to the accumulated update value being equal to or greater than a threshold value (Ricks et al., Section 3 Pg. 4, “The QNN in Figure 1 is an example of such a network, with sufficient complexity to compute the XOR function. Each input node i is represented by a register, |αi i. The two hidden nodes compute a weighted sum of the inputs, |ψi i1 and |ψi i2, and compare the sum to a threshold weight, |ψi i0. If the weighted sum is greater than the threshold the node goes high. The |βik represent internal calculations that take place at each node. The output layer works similarly, taking a weighted sum of the hidden nodes and checking against a threshold. The QNN then checks each computed output and compares it to the target output, |Ωi j sending |ϕi j high when they are equivalent. The performance of the network is denoted by |ρi, which is the number of computed outputs equivalent to their corresponding target output.” teaches updating the weight using the weighted sum (corresponds to the accumulated update value) and comparing the updated value to the threshold value). 
Chilimbi et al. in view of Ricks et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Chilimbi et al. with Ricks et al., with motivation to train the neural network by updating the weight using the accumulated update value in response to the accumulated update value being equal to or greater than a threshold value. “A randomized version avoids some of the exponential increases in complexity with problem size. This algorithm is exponential in the number of qubits of each node’s weight vector instead of in the composite weight vector of the entire network. This means the complexity of the algorithm increases with the number of connections to a node and the precision of each individual weight, dramatically decreasing complexity for problems with large numbers of nodes. This could be a great improvement for larger problems. Preliminary results for both algorithms have been very positive” (Ricks et al., Conclusion). The proposed teaching is beneficial in that it dramatically decreases complexity for problems with large numbers of nodes. 
Regarding Claim 22,
Chilimbi et al. in view of Ricks et al. teaches the apparatus of claim 18
Chilimbi et al. further teaches wherein the one or more processors are further configured to (Chilimbi et al., FIG. 5 and Para. [0048], “Processing unit(s) 612 and can represent, for example, a CPU-type processing unit, a GPU-type processing unit, a field-programmable gate array (FPGA), another class of digital signal processor (DSP), or other hardware logic components that may, in some instances, be driven by a CPU. For example, and without limitation, illustrative types of hardware logic components that can be used include Application-Specific Integrated Circuits (ASICs), Application-Specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), etc. In various embodiments, the processing unit(s) 612 may execute one or more modules and/or processes to cause the server(s) and other machines 610 to perform a variety of functions, as set forth above and explained in further detail in the following disclosure. Additionally, each of the processing unit(s) 612 may possess its own local memory, which also may store program modules, program data, and/or one or more operating systems” teaches the one or more processing unit).
Ricks et al. further teaches determine whether the accumulated update value is equal to or greater than the threshold value at a predetermined update period (Ricks et al., Section 3 Pg. 4, “The QNN in Figure 1 is an example of such a network, with sufficient complexity to compute the XOR function. Each input node i is represented by a register, |αi i. The two hidden nodes compute a weighted sum of the inputs, |ψi i1 and |ψi i2, and compare the sum to a threshold weight, |ψi i0. If the weighted sum is greater than the threshold the node goes high. The |βik represent internal calculations that take place at each node. The output layer works similarly, taking a weighted sum of the hidden nodes and checking against a threshold. The QNN then checks each computed output and compares it to the target output, |Ωi j sending |ϕi j high when they are equivalent. The performance of the network is denoted by |ρi, which is the number of computed outputs equivalent to their corresponding target output” teaches updating the weight (corresponds to updated value) using the weighted sum (corresponds to the accumulated update value) and comparing the weighted sum to the threshold value during the hidden nodes and output layer (corresponds to the predetermined update period)).  
Chilimbi et al. in view of Ricks et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Chilimbi et al. with Ricks et al., with motivation to determine whether the accumulated update value is equal to or greater than the threshold value at a predetermined update period. “A randomized version avoids some of the exponential increases in complexity with problem size. This algorithm is exponential in the number of qubits of each node’s weight vector instead of in the composite weight vector of the entire network. This means the complexity of the algorithm increases with the number of connections to a node and the precision of each individual weight, dramatically decreasing complexity for problems with large numbers of nodes. This could be a great improvement for larger problems. Preliminary results for both algorithms have been very 
Regarding Claim 23,
Chilimbi et al. in view of Ricks et al. teaches the apparatus of claim 22
Chilimbi et al. further teaches wherein the one or more processors are further configured to (Chilimbi et al., FIG. 5 and Para. [0048], “Processing unit(s) 612 and can represent, for example, a CPU-type processing unit, a GPU-type processing unit, a field-programmable gate array (FPGA), another class of digital signal processor (DSP), or other hardware logic components that may, in some instances, be driven by a CPU. For example, and without limitation, illustrative types of hardware logic components that can be used include Application-Specific Integrated Circuits (ASICs), Application-Specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), etc. In various embodiments, the processing unit(s) 612 may execute one or more modules and/or processes to cause the server(s) and other machines 610 to perform a variety of functions, as set forth above and explained in further detail in the following disclosure. Additionally, each of the processing unit(s) 612 may possess its own local memory, which also may store program modules, program data, and/or one or more operating systems” teaches the one or more processing unit).
Ricks et al. further teaches accumulate the individual update values in the accumulation buffer until a next update period in response to the accumulated update value being smaller than the threshold value (Ricks et al., Section 3 Pg. 4, “We propose a QNN that operates much like a classical ANN composed of several layers of perceptrons – an input layer, one or more hidden layers and an output layer. Each layer is fully connected to the previous layer. Each hidden layer computes a weighted sum of the outputs of the previous layer. If this is sum above a threshold, the node goes high, otherwise it stays low. The output layer does the same thing as the hidden layer(s), except that it also checks its accuracy against the target output of the network. The network as a whole computes a function by checking which output bit is high. There are no checks to make sure exactly one output is high. This allows the network to learn data sets which have one output high or binary-encoded outputs” teaches the hidden nodes (corresponds to the accumulation buffer) computes a weighted sum (corresponds to accumulating the individual update) that is compared to the threshold value If the sum is bigger than the threshold, the node goes high, if the sum is smaller than the threshold, the node stays low).  
Chilimbi et al. in view of Ricks et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Chilimbi et al. with Ricks et al., with motivation to accumulate the individual update values in the accumulation buffer until a next update period in response to the accumulated update value being smaller than the threshold value. “A randomized version avoids some of the exponential increases in complexity with problem size. This algorithm is exponential in the number of qubits of each node’s 
Regarding Claim 30,
Chilimbi et al. in view of Ricks et al. teaches the apparatus of claim 18, further comprising
Chilimbi et al. further teaches a memory configured to store one or more programs (Chilimbi et al., Para. [0048], “In various embodiments, the processing unit(s) 612 may execute one or more modules and/or processes to cause the server(s) and other machines 610 to perform a variety of functions, as set forth above and explained in further detail in the following disclosure. Additionally, each of the processing unit(s) 612 may possess its own local memory, which also may store program modules, program data, and/or one or more operating systems” teaches local memory in the processing unit (corresponds to the memory) that may store program modules and/or program data (corresponds to the one or more programs)).
wherein the one or more processors are configured to calculate the individual update values, generate the accumulated update value, and train the neural network, in response to executing the one or more programs (Chilimbi et al., FIG. 5 and Para. [0048], “Processing unit(s) 612 and can represent, for example, a CPU-type processing unit, a GPU-type processing unit, a field-programmable gate array (FPGA), another class of digital signal processor (DSP), or other hardware logic components that may, in some instances, be driven by a CPU. For example, and without limitation, illustrative types of hardware logic components that can be used include Application-Specific Integrated Circuits (ASICs), Application-Specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), etc. In various embodiments, the processing unit(s) 612 may execute one or more modules and/or processes to cause the server(s) and other machines 610 to perform a variety of functions, as set forth above and explained in further detail in the following disclosure. Additionally, each of the processing unit(s) 612 may possess its own local memory, which also may store program modules, program data, and/or one or more operating systems ” teaches a method for training a neural network that is implemented by one or more processing unit. Para. [0086], “Block 912 illustrates updating the individual weight values to generate updated individual weight values. The updating may be the result of asynchronous communication between the replicas 704A-704N and the global parameter server(s) 706. As described above, the communications may be asynchronous such that individual replicas 704A-704N communicate independently with the global parameter server(s) 706. The different replicas 704A-704N may communicate at different rates with the global parameter server(s) 706. The rates may be based on predetermined time intervals or may be responsive to the replicas 704A-704N processing a predetermined number of the individual data items” teaches generating an updated individual weight value from the results of the asynchronous communication between the replicas. Para. [0064], “Two different communication protocols for updating parameter weights are described herein. In one embodiment, a communication protocol locally computes and accumulates the weight updates in a buffer that is periodically sent to the global parameter server(s) 706 when a predetermined number, e.g., “k” (which is typically in the hundreds to thousands) of images (e.g., data items) have been processed. This communication is shown by arrows 712 in FIG. 7. The global parameter server(s) 706 then directly apply these accumulated updates to the stored weights. This works well for the convolutional layers since the volume of weights is low due to weight sharing” teaches determining the accumulated updates (corresponds to the accumulated update value) by accumulating the weight updates in a buffer).
Regarding Claim 31,
Chilimbi et al. teaches a processor-implemented neural network method, the method comprising (Chilimbi et al., FIG. 5 and Para. [0048], “Processing unit(s) 612 and can represent, for example, a CPU-type processing unit, a GPU-type processing unit, a field-programmable gate array (FPGA), another class of digital signal processor (DSP), or other hardware logic components that may, in some instances, be driven by a CPU. For example, and without limitation, illustrative types of hardware logic components that can be used include Application-Specific Integrated Circuits (ASICs), Application-Specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), etc. In various embodiments, the processing unit(s) 612 may execute one or more modules and/or processes to cause the server(s) and other machines 610 to perform a variety of functions, as set forth above and explained in further detail in the following disclosure. Additionally, each of the processing unit(s) 612 may possess its own local memory, which also may store program modules, program data, and/or one or more operating systems” teaches a method for training a neural network that is implemented by one or more processing unit).
calculating individual update values for a weight assigned to a connection relationship between nodes included in a neural network (Chilimbi et al., Para. [0086], “Block 912 illustrates updating the individual weight values to generate updated individual weight values. The updating may be the result of asynchronous communication between the replicas 704A-704N and the global parameter server(s) 706. As described above, the communications may be asynchronous such that individual replicas 704A-704N communicate independently with the global parameter server(s) 706. The different replicas 704A-704N may communicate at different rates with the global parameter server(s) 706. The rates may be based on predetermined time intervals or may be responsive to the replicas 704A-704N processing a predetermined number of the individual data items” teaches generating an updated individual weight value from the results of the asynchronous communication between the replicas (corresponds to the connection relationship between nodes in a neural network)).
Chilimbi et al., Para. [0064], “Two different communication protocols for updating parameter weights are described herein. In one embodiment, a communication protocol locally computes and accumulates the weight updates in a buffer that is periodically sent to the global parameter server(s) 706 when a predetermined number, e.g., “k” (which is typically in the hundreds to thousands) of images (e.g., data items) have been processed. This communication is shown by arrows 712 in FIG. 7. The global parameter server(s) 706 then directly apply these accumulated updates to the stored weights. This works well for the convolutional layers since the volume of weights is low due to weight sharing” teaches determining the accumulated updates (corresponds to the accumulated update value) by accumulating the weight updates (corresponds to accumulating the individual update values) in a buffer (corresponds to the accumulation buffer)).
Chilimbi et al. does not appear to explicitly teach training the neural network by updating the weight using the accumulated update value in response to the accumulated update value being equal to or greater than a threshold value
However, Ricks et al., teaches training the neural network by updating the weight using the accumulated update value in response to the accumulated update value being equal to or greater than a threshold value (Ricks et al., Section 3 Pg. 4, “The QNN in Figure 1 is an example of such a network, with sufficient complexity to compute the XOR function. Each input node i is represented by a register, |αi i. The two hidden nodes compute a weighted sum of the inputs, |ψi i1 and |ψi i2, and compare the sum to a threshold weight, |ψi i0. If the weighted sum is greater than the threshold the node goes high. The |βik represent internal calculations that take place at each node. The output layer works similarly, taking a weighted sum of the hidden nodes and checking against a threshold. The QNN then checks each computed output and compares it to the target output, |Ωi j sending |ϕi j high when they are equivalent. The performance of the network is denoted by |ρi, which is the number of computed outputs equivalent to their corresponding target output.” teaches updating the weight using the weighted sum (corresponds to the accumulated update value) and comparing the updated value to the threshold value). 
Chilimbi et al. in view of Ricks et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Chilimbi et al. with Ricks et al., with motivation of training the neural network by updating the weight using the accumulated update value in response to the accumulated update value being equal to or greater than a threshold value. “A randomized version avoids some of the exponential increases in complexity with problem size. This algorithm is exponential in the number of qubits of each node’s weight vector instead of in the composite weight vector of the entire network. This means the complexity of the algorithm increases with the number of connections to a node and the precision of each individual weight, dramatically decreasing complexity for problems with large numbers of nodes. This could be a great improvement for larger problems. Preliminary results for both algorithms have been very positive” (Ricks et al., 
Regarding Claim 32,
Chilimbi et al. in view of Ricks et al. teaches the method of claim 31, wherein the updating comprises
Chilimbi et al. further teaches determining a portion of the accumulated update value to be an effective update value (Chilimbi et al., Para. [0064], “Two different communication protocols for updating parameter weights are described herein. In one embodiment, a communication protocol locally computes and accumulates the weight updates in a buffer that is periodically sent to the global parameter server(s) 706 when a predetermined number, e.g., “k” (which is typically in the hundreds to thousands) of images (e.g., data items) have been processed. This communication is shown by arrows 712 in FIG. 7. The global parameter server(s) 706 then directly apply these accumulated updates to the stored weights. This works well for the convolutional layers since the volume of weights is low due to weight sharing” teaches determining the accumulated updates (corresponds to the effective update value) based on accumulating the weight updates (corresponds to accumulated update value)).
updating the weight by adding the effective update value to the weight (Chilimbi et al., Para. [0064], “Two different communication protocols for updating parameter weights are described herein. In one embodiment, a communication protocol locally computes and accumulates the weight updates in a buffer that is periodically sent to the global parameter server(s) 706 when a predetermined number, e.g., “k” (which is typically in the hundreds to thousands) of images (e.g., data items) have been processed. This communication is shown by arrows 712 in FIG. 7. The global parameter server(s) 706 then directly apply these accumulated updates to the stored weights. This works well for the convolutional layers since the volume of weights is low due to weight sharing” teaches applying the accumulated updates (corresponds to effective update values) to the stored weights).
Claims 2-3 and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable over Chilimbi et al. in view of Ricks et al. in further view of Yin et al. (“A New Class of Nonlinear Filters-Neural Filters”) in view of Shafiee et al. (“ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars”)
Regarding Claim 2,
Chilimbi et al. in view of Ricks et al. teaches the method of claim 1, wherein: 
Chilimbi et al. further teaches determining an effective update value based on the accumulated update value (Chilimbi et al., Para. [0064], “Two different communication protocols for updating parameter weights are described herein. In one embodiment, a communication protocol locally computes and accumulates the weight updates in a buffer that is periodically sent to the global parameter server(s) 706 when a predetermined number, e.g., “k” (which is typically in the hundreds to thousands) of images (e.g., data items) have been processed. This communication is shown by arrows 712 in FIG. 7. The global parameter server(s) 706 then directly apply these accumulated updates to the stored weights. This works well for the convolutional layers since the volume of weights is low due to weight sharing” teaches determining the accumulated updates (corresponds to the effective update value) based on accumulating the weight updates (corresponds to accumulated update value)).
adding the effective update value to the weight (Chilimbi et al., Para. [0064], “Two different communication protocols for updating parameter weights are described herein. In one embodiment, a communication protocol locally computes and accumulates the weight updates in a buffer that is periodically sent to the global parameter server(s) 706 when a predetermined number, e.g., “k” (which is typically in the hundreds to thousands) of images (e.g., data items) have been processed. This communication is shown by arrows 712 in FIG. 7. The global parameter server(s) 706 then directly apply these accumulated updates to the stored weights. This works well for the convolutional layers since the volume of weights is low due to weight sharing” teaches applying the accumulated updates (corresponds to effective update values) to the stored weights).
Chilimbi et al. in view of Ricks et al. et al. does not appear to explicitly teach the threshold value is a value corresponding to a least significant effective bit of the weight; and the updating comprises
However, Yin et al., teaches the threshold value is a value corresponding to a least significant effective bit of the weight; and the updating comprises (Yin et al., Section V.A Pg. 1211, “The soft neural filter seems to give a small weight to the least significant threshold levels of the input samples in the window. This follows from the fact that the least significant threshold levels contain very little correlation thus none of them should be emphasized in the estimation procedure” teaches the threshold levels (corresponds to the threshold value) correlating to the least significant of the small weight).
Chilimbi et al. in view of Ricks et al. in view of Yin et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Chilimbi et al. and Ricks et al. with Yin et al., with motivation of the threshold value is a value corresponding to a least significant effective bit of the weight; and the updating comprises. “Adaptive LMA algorithm and adaptive LMS algorithm were developed for finding optimal neural filters under the MSE criterion and MAE criterion, respectively. These algorithms, analogous to the backpropagation algorithm, are amenable to implementation on already existing hardware in neural networks. Experimental results in 1 -D and 2-D signal processing demonstrated that adaptive neural filters can effectively remove various kinds of noise such as Gaussian noise and impulsive noise” (Yin et al., Conclusion). The proposed teaching is beneficial in that it removes various kinds of noise such as Gaussian noise and impulsive noise.
Chilimbi et al. in view of Ricks et al. in view of Yin et al. does not appear to explicitly teach subtracting the effective update value from the accumulated update value of the accumulation buffer 
However, Shafiee et al., teaches subtracting the effective update value from the accumulated update value of the accumulation buffer (Shafiee et al., Section V Pg. 19, “where ai refers to the ith input value. The conversion requires us to compute the sum of the current input values ai, which is done with one additional column per array, referred to as the unit column. During an IMA operation, the unit column produces the result                        
                             
                            
                                
                                    ∑
                                    
                                        i
                                        =
                                        0
                                    
                                    
                                        R
                                        -
                                        1
                                    
                                
                                
                                    f
                                    
                                        
                                            x
                                        
                                    
                                    =
                                    
                                        
                                            a
                                        
                                        
                                            i
                                        
                                    
                                
                            
                        
                    . The results of any columns that have been stored in flipped form is subtracted from the results of the unit column. In addition, we need a bit per column to track if the column has original or flipped weights” teaches subtracting the results in flipped form (corresponds to the effective update value with the sum of the current input value (corresponds to the accumulated update value). Section III Pg. 16, “Each tile is composed of eDRAM buffers to store input values, a number of in-situ multiply-accumulate (IMA) units, and output registers to aggregate results, all connected with a shared bus. The tile also has shift-and-add, sigmoid, and max-pool units. Each IMA has a few crossbar arrays and ADCs, connected with a shared bus. The IMA also has input/output registers and shift-and-add units. A detailed discussion of each component is deferred until Section VI” teaches the input values and the update accumulated values be stored in an eDRAM buffer (corresponds to the accumulation buffer)).
Chilimbi et al. in view of Ricks et al. in view of Yin et al. in view of Shafiee et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Chilimbi et al., Ricks et al., and Yin et al. with Shafiee et al., with motivation of subtracting the effective update value from the accumulated update value of the accumulation buffer. “In particular, a balanced inter-layer pipeline with replication, an intra-tile pipeline, efficient 
Regarding Claim 3,
Chilimbi et al. in view of Ricks et al. in view of Yin et al. in view of Shafiee et al. teaches the method of claim 2
Chilimbi et al. further teaches wherein the effective update value is a portion of the accumulated update value (Chilimbi et al., Para. [0064], “Two different communication protocols for updating parameter weights are described herein. In one embodiment, a communication protocol locally computes and accumulates the weight updates in a buffer that is periodically sent to the global parameter server(s) 706 when a predetermined number, e.g., “k” (which is typically in the hundreds to thousands) of images (e.g., data items) have been processed. This communication is shown by arrows 712 in FIG. 7. The global parameter server(s) 706 then directly apply these accumulated updates to the stored weights. This works well for the convolutional layers since the volume of weights is low due to weight sharing” teaches determining the accumulated updates 
Regarding Claim 19,
Chilimbi et al. in view of Ricks et al. teaches the apparatus of claim 18, wherein
Chilimbi et al. further teaches the one or more processors are further configured to (Chilimbi et al., Para. [0048], “Processing unit(s) 612 and can represent, for example, a CPU-type processing unit, a GPU-type processing unit, a field-programmable gate array (FPGA), another class of digital signal processor (DSP), or other hardware logic components that may, in some instances, be driven by a CPU. For example, and without limitation, illustrative types of hardware logic components that can be used include Application-Specific Integrated Circuits (ASICs), Application-Specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), etc. In various embodiments, the processing unit(s) 612 may execute one or more modules and/or processes to cause the server(s) and other machines 610 to perform a variety of functions, as set forth above and explained in further detail in the following disclosure. Additionally, each of the processing unit(s) 612 may possess its own local memory, which also may store program modules, program data, and/or one or more operating systems” teaches the one or more processing unit).
… determine an effective update value based on the accumulated update value (Chilimbi et al., Para. [0064], “Two different communication protocols for updating parameter weights are described herein. In one embodiment, a communication protocol locally computes and accumulates the weight updates in a buffer that is periodically sent to the global parameter server(s) 706 when a predetermined number, e.g., “k” (which is typically in the hundreds to thousands) of images (e.g., data items) have been processed. This communication is shown by arrows 712 in FIG. 7. The global parameter server(s) 706 then directly apply these accumulated updates to the stored weights. This works well for the convolutional layers since the volume of weights is low due to weight sharing” teaches determining the accumulated updates (corresponds to the effective update value) based on accumulating the weight updates (corresponds to accumulated update value)).
add the effective update value to the weight (Chilimbi et al., Para. [0064], “Two different communication protocols for updating parameter weights are described herein. In one embodiment, a communication protocol locally computes and accumulates the weight updates in a buffer that is periodically sent to the global parameter server(s) 706 when a predetermined number, e.g., “k” (which is typically in the hundreds to thousands) of images (e.g., data items) have been processed. This communication is shown by arrows 712 in FIG. 7. The global parameter server(s) 706 then directly apply these accumulated updates to the stored weights. This works well for the convolutional layers since the volume of weights is low due to weight sharing” teaches applying the accumulated updates (corresponds to effective update values) to the stored weights).
Chilimbi et al. in view of Ricks et al. does not appear to explicitly teach the threshold value is a value corresponding to a least significant effective bit of the weight; and the updating comprises
However, Yin et al., teaches the threshold value is a value corresponding to a least significant effective bit of the weight (Yin et al., Section V.A Pg. 1211, “The soft neural filter seems to give a small weight to the least significant threshold levels of the input samples in the window. This follows from the fact that the least significant threshold levels contain very little correlation thus none of them should be emphasized in the estimation procedure” teaches the threshold levels (corresponds to the threshold value) correlating to the least significant of the small weight).
Chilimbi et al. in view of Ricks et al. in view of Yin et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Chilimbi et al. and Ricks et al. with Yin et al., with motivation of the threshold value is a value corresponding to a least significant effective bit of the weight; and the updating comprises. “Adaptive LMA algorithm and adaptive LMS algorithm were developed for finding optimal neural filters under the MSE criterion and MAE criterion, respectively. These algorithms, analogous to the backpropagation algorithm, are amenable to implementation on already existing hardware in neural networks. Experimental results in 1 -D and 2-D signal processing demonstrated that adaptive neural filters can effectively remove various kinds of noise 
Chilimbi et al. in view of Ricks et al. in view of Yin et al. does not appear to explicitly teach subtract the effective update value from the accumulated update value of the accumulation buffer
However, Shafiee et al., teaches subtract the effective update value from the accumulated update value of the accumulation buffer (Shafiee et al., Section V Pg. 19, “where ai refers to the ith input value. The conversion requires us to compute the sum of the current input values ai, which is done with one additional column per array, referred to as the unit column. During an IMA operation, the unit column produces the result                        
                             
                            
                                
                                    ∑
                                    
                                        i
                                        =
                                        0
                                    
                                    
                                        R
                                        -
                                        1
                                    
                                
                                
                                    f
                                    
                                        
                                            x
                                        
                                    
                                    =
                                    
                                        
                                            a
                                        
                                        
                                            i
                                        
                                    
                                
                            
                        
                    . The results of any columns that have been stored in flipped form is subtracted from the results of the unit column. In addition, we need a bit per column to track if the column has original or flipped weights” teaches subtracting the results in flipped form (corresponds to the effective update value with the sum of the current input value (corresponds to the accumulated update value). Section III Pg. 16, “Each tile is composed of eDRAM buffers to store input values, a number of in-situ multiply-accumulate (IMA) units, and output registers to aggregate results, all connected with a shared bus. The tile also has shift-and-add, sigmoid, and max-pool units. Each IMA has a few crossbar arrays and ADCs, connected with a shared bus. The IMA also has input/output registers and shift-and-add units. A detailed discussion of each component is deferred until Section VI” teaches the input values and the update accumulated values be stored in an eDRAM buffer (corresponds to the accumulation buffer)).
Chilimbi et al. in view of Ricks et al. in view of Yin et al. in view of Shafiee et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Chilimbi et al., Ricks et al., and Yin et al. with Shafiee et al., with motivation to subtract the effective update value from the accumulated update value of the accumulation buffer. “In particular, a balanced inter-layer pipeline with replication, an intra-tile pipeline, efficient handling of signed arithmetic, and bit encoding schemes are required to deliver high throughput and manage the high overheads of ADCs, DACs, and eDRAMs. We note that relative to DaDianNao, ISAAC is able to deliver higher peak computational and power efficiency because of the nature of the crossbar, and in spite of the ADCs accounting for nearly half the chip power. On benchmark CNNs and DNNs, we observe that ISAAC is able to out-perform DaDianNao significantly in early layers” (Shafiee et al., Conclusion). The proposed teaching is beneficial in that it delivers higher peak computational and power efficiency. 
Regarding Claim 20,
Chilimbi et al. in view of Ricks et al. in view of Yin et al. in view of Shafiee et al. teaches the method of claim 19, 
Chilimbi et al. further teaches wherein the effective update value is a portion of the accumulated update value (Chilimbi et al., Para. [0064], “Two different communication protocols for updating parameter weights are described herein. In one embodiment, a communication protocol locally computes and accumulates the weight updates in a buffer that is periodically sent to the global parameter server(s) 706 when a predetermined number, e.g., “k” (which is typically in the hundreds to thousands) of images (e.g., data items) have been processed. This communication is shown by arrows 712 in FIG. 7. The global parameter server(s) 706 then directly apply these accumulated updates to the stored weights. This works well for the convolutional layers since the volume of weights is low due to weight sharing” teaches determining the accumulated updates (corresponds to the effective update value) based on accumulating the weight updates (corresponds to accumulated update value)).
Claim 33 is rejected under 35 U.S.C. 103 as being unpatentable over Chilimbi et al. in view of Ricks et al. in view of Shafiee et al.
Regarding Claim 33,
Chilimbi et al. in view of Ricks et al. teaches the method of claim 32, further comprising
Chilimbi et al. further teaches adding another individual update value to the accumulated update value (Chilimbi et al., Para. [0064], “Two different communication protocols for updating parameter weights are described herein. In one embodiment, a communication protocol locally computes and accumulates the weight updates in a buffer that is periodically sent to the global parameter server(s) 706 when a predetermined number, e.g., “k” (which is typically in the hundreds to thousands) of images (e.g., data items) have been processed. This communication is shown by arrows 712 in FIG. 7. The global parameter server(s) 706 then directly apply these accumulated updates to the stored weights. This works well for the convolutional layers since the volume of weights is low due to weight sharing” teaches determining the accumulated updates (corresponds to the accumulated update value) by accumulating the weight updates (corresponds to accumulating the individual update values)).
Ricks et al. further teaches re-updating the weight using the accumulated update value in response to the accumulated update value being equal to or greater than the threshold value (Ricks et al., Section 3 Pg. 4, “The QNN in Figure 1 is an example of such a network, with sufficient complexity to compute the XOR function. Each input node i is represented by a register, |αi i. The two hidden nodes compute a weighted sum of the inputs, |ψi i1 and |ψi i2, and compare the sum to a threshold weight, |ψi i0. If the weighted sum is greater than the threshold the node goes high. The |βik represent internal calculations that take place at each node. The output layer works similarly, taking a weighted sum of the hidden nodes and checking against a threshold. The QNN then checks each computed output and compares it to the target output, |Ωi j sending |ϕi j high when they are equivalent. The performance of the network is denoted by |ρi, which is the number of computed outputs equivalent to their corresponding target output.” teaches updating the weight using the weighted sum (corresponds to the accumulated update value) and comparing the updated value to the threshold value). 
Chilimbi et al. in view of Ricks et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they 
Chilimbi et al. in view of Ricks et al. does not appear to explicitly teach subtracting the effective update value from the accumulated update value
However, Shafiee et al., teaches subtracting the effective update value from the accumulated update value (Shafiee et al., Section V Pg. 19, “where ai refers to the ith input value. The conversion requires us to compute the sum of the current input values ai, which is done with one additional column per array, referred to as the unit column. During an IMA operation, the unit column produces the result                        
                             
                            
                                
                                    ∑
                                    
                                        i
                                        =
                                        0
                                    
                                    
                                        R
                                        -
                                        1
                                    
                                
                                
                                    f
                                    
                                        
                                            x
                                        
                                    
                                    =
                                    
                                        
                                            a
                                        
                                        
                                            i
                                        
                                    
                                
                            
                        
                    . The results of any columns that have been stored in flipped form is subtracted from the results of the unit column. In addition, we need a bit per column to track if the column has original or flipped weights” teaches subtracting the results in flipped form (corresponds to the effective update value with the sum of the current input value (corresponds to the accumulated update value). Section III Pg. 16, “Each tile is composed of eDRAM buffers to store input values, a number of in-situ multiply-accumulate (IMA) units, and output registers to aggregate results, all connected with a shared bus. The tile also has shift-and-add, sigmoid, and max-pool units. Each IMA has a few crossbar arrays and ADCs, connected with a shared bus. The IMA also has input/output registers and shift-and-add units. A detailed discussion of each component is deferred until Section VI” teaches the input values and the update accumulated values be stored in an eDRAM buffer (corresponds to the accumulation buffer)).
Chilimbi et al. in view of Ricks et al. in view of Shafiee et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Chilimbi et al. and Ricks et al. with Shafiee et al., with motivation of subtracting the effective update value from the accumulated update value. “In particular, a balanced inter-layer pipeline with replication, an intra-tile pipeline, efficient handling of signed arithmetic, and bit encoding schemes are required to deliver high throughput and manage the high overheads of ADCs, DACs, and eDRAMs. We note that relative to DaDianNao, ISAAC is able to deliver higher peak computational and power efficiency because of the nature of the crossbar, and in spite of the ADCs accounting for nearly half the chip power. On benchmark CNNs and DNNs, .
Claims 4 and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Chilimbi et al. in view of Ricks et al. in view of Yin et al. in view of Shafiee et al. in further view of Li et al. (“Design of Ternary Neural Network With 3-D Vertical RRAM Array”)
Regarding Claim 4,
Chilimbi et al. in view of Ricks et al. in view of Yin et al. in view of Shafiee et al. teaches the method of claim 2
Chilimbi et al. in view of Ricks et al. in view of Yin et al. in view of Shafiee et al. does not appear to explicitly teach wherein the effective update value is a multiple of the least significant effective bit of the weight
However, Li et al., teaches wherein the effective update value is a multiple of the least significant effective bit of the weight (Li et al., Section II.A Pg. 2722, “Therefore, it has two synaptic weight matrices (WHIH: 400 × 200, and WHHO: 200 × 10). Here the superscript H means high precision format, and subscript IH means input to hidden layer, and HO means hidden layer to output. To support the low precision computation, the MNIST input images are converted to black and white (1-b data). All the value of weights and neurons are regularized between −1 and 1, and those value are converted to binary representation b0b1b2 ... bn, where b0 is the sign of value, b1 is the most significant bit (MSB) and bn is the least significant bit (LSB). Commonly, a high precision format of 6-b (including 1-b for sign) is needed for the synaptic weights for MNIST data set [9]. The reason is that the back-propagation passes the small training errors from the output layer to the input layer, if the precision is insufficient, such small errors will not be accumulated in the weight update” teaches the weight update (corresponds to the effective update value) being a different version of the least significant bit of the synaptic weight).
Chilimbi et al. in view of Ricks et al. in view of Yin et al. in view of Shafiee et al. in view of Li et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Chilimbi et al., Ricks et al., Yin et al., and Shafiee et al. with Li et al., with motivation wherein the effective update value is a multiple of the least significant effective bit of the weight. “Through a comparative study of TNN which aggressively reduces the weight precision to ternary levels (+1, 0, −1), this paper takes the further exploration of neuromorphic computing accelerator from 2-D cross-point structure to 3-D vertical integration. On the one side, TNN benefits the implementation with the current available binary RRAM devices to overcome the nonlinearity of weight update problem caused by the premature “analog” synapses. One the other side, compared to the 2-D implementation, the proposed 3-D V-RRAM implementation shows larger write margin for weighted sum/weight update, smaller latency, and energy consumption for weight update” (Li et al., Conclusion). The proposed teaching is beneficial in that it reduces the weight precision to ternary levels and overcomes the nonlinearity of weight update problem.
Regarding Claim 21,
Chilimbi et al. in view of Ricks et al. in view of Yin et al. in view of Shafiee et al. teaches the method of claim 19 
Chilimbi et al. in view of Ricks et al. in view of Yin et al. in view of Shafiee et al. does not appear to explicitly teach wherein the effective update value is a multiple of the least significant effective bit of the weight
However, Li et al., teaches wherein the effective update value is a multiple of the least significant effective bit of the weight (Li et al., Section II.A Pg. 2722, “Therefore, it has two synaptic weight matrices (WHIH: 400 × 200, and WHHO: 200 × 10). Here the superscript H means high precision format, and subscript IH means input to hidden layer, and HO means hidden layer to output. To support the low precision computation, the MNIST input images are converted to black and white (1-b data). All the value of weights and neurons are regularized between −1 and 1, and those value are converted to binary representation b0b1b2 ... bn, where b0 is the sign of value, b1 is the most significant bit (MSB) and bn is the least significant bit (LSB). Commonly, a high precision format of 6-b (including 1-b for sign) is needed for the synaptic weights for MNIST data set [9]. The reason is that the back-propagation passes the small training errors from the output layer to the input layer, if the precision is insufficient, such small errors will not be accumulated in the weight update” teaches the weight update (corresponds to the effective update value) being regularized and converted binary representation of the least significant bit (corresponds to the multiple of the least significant effective bit of the weight)).
Chilimbi et al. in view of Ricks et al. in view of Yin et al. in view of Shafiee et al. in view of Li et al. are analogous art because they are from the same field of endeavor and .
Claims 7, 9, 12, and 24 are rejected under 35 U.S.C. 103 as being unpatentable over Chilimbi et al. in view of Ricks et al. in further view of Yang et al. (“US 20170061279 A1”) and Stromatias et al. (“Robustness of spiking Deep Belief Networks to noise and reduced bit precision of neuro-inspired hardware platforms”)
Regarding Claim 7,
Chilimbi et al. in view of Ricks et al. teaches the method of claim 1, wherein
Chilimbi et al. in view of Ricks et al. does not appear to explicitly teach the weight is a fixed point value comprising a first sign bit, a first integer part, and a first 
However, Yang et al., teaches 42012055.0449the weight is a fixed point value comprising a first sign bit, a first integer part, and a first fractional part (Yang et al., Para. [0013], “Updating an artificial neural network is disclosed. In some embodiments, a node characteristic is represented using a fixed point node parameter and a network characteristic is represented using a fixed point network parameter. For example, an activation value of the artificial neural network is represented as a node characteristic in fixed point number format rather than a floating point number format and a weight value of the artificial neural network is represented as a network characteristic in fixed point number format rather than a floating point number format” teaches the weight value represented in a fixed point number format. Para. [0047], “One example of the fixed point representation format identification specified by the instruction includes a specification of a fixed number of binary digits that represents a fractional component (e.g., number of the digits after a radix point). In some embodiments, the fixed point representation format identifies the number of bits before a radix point. In some embodiments, the fixed point representation format identifies a location of a radix point. In some embodiments, the fixed point representation format identifies that a bit is utilized to identify a sign (e.g., positive or negative) of the value” teaches the fixed point value comprising of a fractional component (corresponds to the first fractional part) and a representation of the number of bits before a radix point (corresponds to the first Yang et al. further teaches the fixed point representation format comprising of a bit that is utilized to identify the sign of the value (corresponds to the first sign bit)).
… the updating comprises adding, to the weight, a value of at least one bit that overlaps the first fractional part of the weight among bits representing the second fractional part of the accumulated update value (Yang et al., Para. [0013], “Updating an artificial neural network is disclosed. In some embodiments, a node characteristic is represented using a fixed point node parameter and a network characteristic is represented using a fixed point network parameter. For example, an activation value of the artificial neural network is represented as a node characteristic in fixed point number format rather than a floating point number format and a weight value of the artificial neural network is represented as a network characteristic in fixed point number format rather than a floating point number format” teaches the weight value represented in a fixed point number format. Para. [0040], “In some embodiments, rather than using a single fixed point representation format with a single fixed number of digits after the radix point for all values of the neural network, each value of the neural network is able to be represented using different fixed point representation formats (e.g., each value may have a different number of fixed bits used to represent the number(s) after the radix point). By allowing variable fixed point representation formats, the amount of fractional precision able to be represented using the same number of total bits may be variably modified to dynamically achieve the desired amount of fractional precision and dynamic range of numbers able to be represented” 
Chilimbi et al. in view of Ricks et al. in view of Yang et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Chilimbi et al. and Ricks et al. with Yang et al., with motivation of the weight is a fixed point value comprising a first sign bit, a first integer part, and a first fractional part and the updating comprises adding, to the weight, a value of at least one bit that overlaps the first fractional part of the weight among bits representing the second fractional part of the accumulated update value. “However, the precision of a floating point number representation utilizing a large number of bits is likely not needed or even not desirable when updating the neural network. For example, with a large amount of training data and new ways of regularizing numbers, large floating point number representations may not be necessary when updating a neural network. Additionally, imprecision added by not utilizing large floating point number representations may be beneficial when updating neural networks to prevent overfitting of data (e.g., prevent neural network from representing random error/noise instead of the desired data relationship)” (Yang et al., Para. [0039]). The proposed teaching is beneficial in that it prevents overfitting of data.
Chilimbi et al. in view of Ricks et al. in view of Yang et al. does not appear to explicitly teach the accumulated update value is a fixed point value comprising a second sign bit and a second fractional part
However, Stromatias et al., teaches the accumulated update value is a fixed point value comprising a second sign bit and a second fractional part (Stromatias et al., Section 2.4 Pg. 4, “Throughout this paper we use the notation Qm.f to indicate a fixed-point format where m is the number of bits in the integer part, including the sign bit, followed by a notional binary point, and f is the number of bits in the fractional part. This format is a bit-level format for storing a numeric value” teaches the fixed point format including the sign bit and bits in the fractional part. Section 3.5 Pg. 10, “Importantly, note that the lowprecision weight matrix WL is used to sample from the network, while the weight update is applied to the higher-precision representation WH, and WL is obtained via rounding. As in standard contrastive divergence, the weight update is calculated from the difference of pairwise correlations of the data-driven layers and the model-driven sample layers. Here, although the activations are calculated from the low-precision weights, the updates are accumulated in the high-precision weights. Then, the weights are checked to be within the maximum bounds of the given resolution (Equation 6) for the given fixed-point precision. Finally, the weights are copied over into the lowprecision matrix (Equation 7). The learning can then proceed for another iteration, using the new updated low-precision weight matrix WL. The additional cost of dual-copy rounding is to store a second weight matrix in memory, which is typically not a limiting factor for off-chip learning” teaches the accumulation of the updated weights for the given fixed-point precision).
Chilimbi et al. in view of Ricks et al. in view of Yang et al. in view of Stromatias et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Chilimbi et al. and Ricks et al. in view of Yang et al. with Stromatias et al., with motivation of the accumulated update value is a fixed point value comprising a second sign bit and a second fractional part. “This article investigates how such hardware constraints impact the performance of spiking neural network implementations of DBNs. In particular, the influence of limited bit precision during execution and training, and the impact of silicon mismatch in the synaptic weight parameters of custom hybrid VLSI implementations is studied. Furthermore, the network performance of spiking DBNs is characterized with regard to noise in the spiking input signal. Our results demonstrate that spiking DBNs can tolerate very low levels of hardware bit precision down to almost two bits, and show that their performance can be improved by at least 30% through an adapted training mechanism that takes the bit precision of the target platform into account. Spiking DBNs thus present an important use-case for large-scale hybrid analog-digital or digital neuromorphic platforms such as SpiNNaker, which can execute large but precision-constrained deep networks in real time” (Stromatias et al., Abstract). The proposed teaching is beneficial in that it tolerate low levels of hardware bit precision and improvement in performance.
Regarding Claim 9,
Chilimbi et al. in view of Ricks et al. in view of Yang et al. in view of Stromatias et al. teaches the method of claim 7, further comprising
Yang et al. further teaches adjusting a position of a decimal point of the accumulated update value (Yang et al., Para. [0049], “In some embodiments, the instruction identifies a desired decimal point placement of the result of the operation. For example, the number of bits that are to be utilized to represent digits after a radix point, before the radix point, a positive/negative sign of the result, and/or the total number of bits to be utilized to represent the result is specified in the instruction as the fixed point representation format of the result. This may allow the result of the operation to be in the desired fixed point representation format that is different from the fixed point representation formats of the operands of the operation” teaches the adjusting the desired decimal point placement of the result of the operation (corresponds to the accumulated update weight)).  
Chilimbi et al. in view of Ricks et al. in view of Yang et al. in view of Stromatias et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Chilimbi et al., Ricks et al. and Yang et al. with Stromatias et al., with motivation of adjusting a position of a decimal point of the accumulated update value. “However, the precision of a floating point number representation utilizing a large number of bits is likely not needed 
Regarding Claim 12,
Chilimbi et al. in view of Ricks et al. in view of Yang et al. in view of Stromatias et al. teaches the method of claim 7
Yang et al. further teaches wherein the weight is a dynamic fixed point value of which a bit number of the first fractional part is adjusted (Yang et al., Para. [0013], “Updating an artificial neural network is disclosed. In some embodiments, a node characteristic is represented using a fixed point node parameter and a network characteristic is represented using a fixed point network parameter. For example, an activation value of the artificial neural network is represented as a node characteristic in fixed point number format rather than a floating point number format and a weight value of the artificial neural network is represented as a network characteristic in fixed point number format rather than a floating point number format. The fixed point node parameter and the fixed point network parameter are operated to determine a fixed point intermediate parameter” teaches the weight value being represented as a fixed point value. Para. [0040], “In some embodiments, rather than using a single fixed point representation format with a single fixed number of digits after the radix point for all values of the neural network, each value of the neural network is able to be represented using different fixed point representation formats (e.g., each value may have a different number of fixed bits used to represent the number(s) after the radix point). By allowing variable fixed point representation formats, the amount of fractional precision able to be represented using the same number of total bits may be variably modified to dynamically achieve the desired amount of fractional precision and dynamic range of numbers able to be represented” teaches adjusting the different numbers of fractional precision (corresponds to the first fractional part) of the fixed point value).
Chilimbi et al. in view of Ricks et al. in view of Yang et al. in view of Stromatias et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Chilimbi et al., Ricks et al. and Yang et al. with Stromatias et al., with motivation of wherein the weight is a dynamic fixed point value of which a bit number of the first fractional part is adjusted. “However, the precision of a floating point number representation utilizing a large number of bits is likely not needed or even not desirable when updating the neural network. For example, with a large amount of training data and new ways of regularizing numbers, large floating point number representations may not be necessary when updating a neural network. Additionally, imprecision added by not utilizing large floating 
Regarding Claim 24,
Chilimbi et al. in view of Ricks et al. teaches the apparatus of claim 18, wherein
Chilimbi et al. in view of Ricks et al. does not appear to explicitly teach the weight is a fixed point value comprising a first sign bit, a first integer part, and a first fractional part and the one or more processors are further configured to add, to the weight, a value of at least one bit that overlaps the first fractional part of the weight among bits representing the second fractional part of the accumulated update value
However, Yang et al., teaches the weight is a fixed point value comprising a first sign bit, a first integer part, and a first fractional part (Yang et al., Para. [0013], “Updating an artificial neural network is disclosed. In some embodiments, a node characteristic is represented using a fixed point node parameter and a network characteristic is represented using a fixed point network parameter. For example, an activation value of the artificial neural network is represented as a node characteristic in fixed point number format rather than a floating point number format and a weight value of the artificial neural network is represented as a network characteristic in fixed point number format rather than a floating point number format” teaches the weight value represented in a fixed point number format. Para. [0047], “One example of the fixed point representation format identification specified by the instruction includes a specification of a fixed number of binary digits that represents a fractional component (e.g., number of the digits after a radix point). In some embodiments, the fixed point representation format identifies the number of bits before a radix point. In some embodiments, the fixed point representation format identifies a location of a radix point. In some embodiments, the fixed point representation format identifies that a bit is utilized to identify a sign (e.g., positive or negative) of the value” teaches the fixed point value comprising of a fractional component (corresponds to the first fractional part) and a representation of the number of bits before a radix point (corresponds to the first integer part). Yang et al. further teaches the fixed point representation format comprising of a bit that is utilized to identify the sign of the value (corresponds to the first sign bit)).
... the one or more processors are further configured to add, to the weight, a value of at least one bit that overlaps the first fractional part of the weight among bits representing the second fractional part of the accumulated update value (Yang et al., Para. [0021], “Computer system 200, which includes various subsystems as described below, includes at least one microprocessor subsystem (also referred to as a processor or a central processing unit (CPU)) 202. For example, processor 202 can be implemented by a single-chip processor or by multiple processors. In some embodiments, processor 202 is a general purpose digital processor that controls the operation of the computer system 200. In some embodiments, processor 202 includes processor 102 shown in FIG. 1. Using instructions retrieved from memory 210, the processor 202 controls the reception and manipulation of input data, and the output and display of data on output devices (e.g., display 218)” teaches one or more processors. Para. [0013], “Updating an artificial neural network is disclosed. In some embodiments, a node characteristic is represented using a fixed point node parameter and a network characteristic is represented using a fixed point network parameter. For example, an activation value of the artificial neural network is represented as a node characteristic in fixed point number format rather than a floating point number format and a weight value of the artificial neural network is represented as a network characteristic in fixed point number format rather than a floating point number format” teaches the weight value represented in a fixed point number format. Para. [0040], “In some embodiments, rather than using a single fixed point representation format with a single fixed number of digits after the radix point for all values of the neural network, each value of the neural network is able to be represented using different fixed point representation formats (e.g., each value may have a different number of fixed bits used to represent the number(s) after the radix point). By allowing variable fixed point representation formats, the amount of fractional precision able to be represented using the same number of total bits may be variably modified to dynamically achieve the desired amount of fractional precision and dynamic range of numbers able to be represented” teaches different number of fixed point representation for fractional precision (corresponds to the first and second fractional part) that are modified (corresponds to the at least one bit that overlaps the first fractional part)).  

Chilimbi et al. in view of Ricks et al. in view of Yang et al. does not appear to explicitly teach the accumulated update value is a fixed point value comprising a second sign bit and a second fractional part
Stromatias et al., teaches the accumulated update value is a fixed point value comprising a second sign bit and a second fractional part (Stromatias et al., Section 2.4 Pg. 4, “Throughout this paper we use the notation Qm.f to indicate a fixed-point format where m is the number of bits in the integer part, including the sign bit, followed by a notional binary point, and f is the number of bits in the fractional part. This format is a bit-level format for storing a numeric value” teaches the fixed point format including the sign bit and bits in the fractional part. Section 3.5 Pg. 10, “Importantly, note that the lowprecision weight matrix WL is used to sample from the network, while the weight update is applied to the higher-precision representation WH, and WL is obtained via rounding. As in standard contrastive divergence, the weight update is calculated from the difference of pairwise correlations of the data-driven layers and the model-driven sample layers. Here, although the activations are calculated from the low-precision weights, the updates are accumulated in the high-precision weights. Then, the weights are checked to be within the maximum bounds of the given resolution (Equation 6) for the given fixed-point precision. Finally, the weights are copied over into the lowprecision matrix (Equation 7). The learning can then proceed for another iteration, using the new updated low-precision weight matrix WL. The additional cost of dual-copy rounding is to store a second weight matrix in memory, which is typically not a limiting factor for off-chip learning” teaches the accumulation of the updated weights for the given fixed-point precision).
Chilimbi et al. in view of Ricks et al. in view of Yang et al. in view of Stromatias et al. are analogous art because they are from the same field of endeavor and are from .
Claims 8 and 25-26 are rejected under 35 U.S.C. 103 as being unpatentable over Chilimbi et al. in view of Ricks et al. in view of Yang et al. in view of Stromatias et al. in further view of Boni et al. (“FPGA Implementation of Support Vector Machines with Pseudo-Logarithmic Number Representation”)
Regarding Claim 8,
Chilimbi et al. in view of Ricks et al. in view of Yang et al. in view of Stromatias et al. teaches the method of claim 7
Chilimbi et al. in view of Ricks et al. in view of Yang et al. in view of Stromatias et al. does not appear to explicitly teach wherein the updating comprises initializing the value of the at least one bit that overlaps the first fractional part to a same value as the second sign bit
However, Boni et al., teaches wherein the updating comprises initializing the value of the at least one bit that overlaps the first fractional part to a same value as the second sign bit (Boni et al., Section III.C Pg. 620, “Starting from the pseudo-logarithmic approximation of a number x, the following pseudo-logarithmic number representation is then defined as follows to represent numbers in a digital computer. An initial sign bit is set to one for negative values of x and is set to zero otherwise. The next wi and wf bits are used to encode the integer n and the fractional f parts, respectively. The integer part is encoded in 2's complement format, while the fractional part is encoded as an unsigned fraction. If x=0, the integer part is set to the largest negative value representable in 2's complement with wi bits, that is −2wi, while the fractional part is set to zero. If x≠0, the integer and fractional parts are set to the integer n and fraction f as described in the pseudo-logarithmic approximation above, applied to the absolute value of x” teaches overlapping to the same value of the 2’s complement format sign bit (corresponds to the second sign bit) to the fractional part).  

Regarding Claim 25,
Chilimbi et al. in view of Ricks et al. in view of Yang et al. in view of Stromatias et al. teaches the apparatus of claim 24
Chilimbi et al. further teaches wherein the one or more processors are further configured to (Chilimbi et al., FIG. 5 and Para. [0048], “Processing unit(s) 612 and can represent, for example, a CPU-type processing unit, a GPU-type processing unit, a field-programmable gate array (FPGA), another class of digital signal processor (DSP), or other hardware logic components that may, in some instances, be driven by a CPU. For example, and without limitation, illustrative types of hardware logic components that can be used include Application-Specific Integrated Circuits (ASICs), Application-Specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), etc. In various embodiments, the processing unit(s) 612 may execute one or more modules and/or processes to cause the server(s) and other machines 610 to perform a variety of functions, as set forth above and explained in further detail in the following disclosure. Additionally, each of the processing unit(s) 612 may possess its own local memory, which also may store program modules, program data, and/or one or more operating systems” teaches the one or more processing unit).
Chilimbi et al. in view of Ricks et al. in view of Yang et al. in view of Stromatias et al. does not appear to explicitly teach initialize the value of the at least one bit that overlaps the first fractional part to a same value as the second sign bit
However, Boni et al., teaches initialize the value of the at least one bit that overlaps the first fractional part to a same value as the second sign bit (Boni et al., Section III.C Pg. 620, “Starting from the pseudo-logarithmic approximation of a number x, the following pseudo-logarithmic number representation is then defined as follows to represent numbers in a digital computer. An initial sign bit is set to one for negative values of x and is set to zero otherwise. The next wi and wf bits are used to encode the integer n and the fractional f parts, respectively. The integer part is encoded in 2's complement format, while the fractional part is encoded as an unsigned fraction. If x=0, the integer part is set to the largest negative value representable in 2's complement with wi bits, that is −2wi, while the fractional part is set to zero. If x≠0, the integer and fractional parts are set to the integer n and fraction f as described in the pseudo-logarithmic approximation above, applied to the absolute value of x” teaches overlapping to the same value of the 2’s complement format sign bit (corresponds to the second sign bit) to the fractional part).  
Chilimbi et al. in view of Ricks et al. in view of Yang et al. in view of Stromatias et al. in view of Boni et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Chilimbi et al., Ricks et al., Yang et al. and Stromatias et al. with Boni et al., with motivation to initialize the value of the at least one bit that overlaps the first fractional part to a same value as the second sign bit. “The proposed pseudo-logarithmic number representation has shown to be usable in connection with SVM computations without encountering any substantial loss of precision. This makes it a possible contender for hardware implementation of SVM architectures, in particular those implemented as embedded systems or as silicon chips, where the substitution of the costly multipliers 
Regarding Claim 26,
Chilimbi et al. in view of Ricks et al. in view of Yang et al. in view of Stromatias et al. in view of Boni et al. teaches the apparatus of claim 24
Chilimbi et al. further teaches wherein the one or more processors are further configured to (Chilimbi et al., FIG. 5 and Para. [0048], “Processing unit(s) 612 and can represent, for example, a CPU-type processing unit, a GPU-type processing unit, a field-programmable gate array (FPGA), another class of digital signal processor (DSP), or other hardware logic components that may, in some instances, be driven by a CPU. For example, and without limitation, illustrative types of hardware logic components that can be used include Application-Specific Integrated Circuits (ASICs), Application-Specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), etc. In various embodiments, the processing unit(s) 612 may execute one or more modules and/or processes to cause the server(s) and other machines 610 to perform a variety of functions, as set forth above and explained in further detail in the following disclosure. Additionally, each of the processing unit(s) 612 may possess its own local memory, which also may store program modules, program data, and/or one or more operating systems” teaches the one or more processing unit).
Yang et al. further teaches adjust a position of a decimal point of the accumulated update value (Yang et al., Para. [0049], “In some embodiments, the instruction identifies a desired decimal point placement of the result of the operation. For example, the number of bits that are to be utilized to represent digits after a radix point, before the radix point, a positive/negative sign of the result, and/or the total number of bits to be utilized to represent the result is specified in the instruction as the fixed point representation format of the result. This may allow the result of the operation to be in the desired fixed point representation format that is different from the fixed point representation formats of the operands of the operation” teaches the adjusting the desired decimal point placement of the result of the operation (corresponds to the accumulated update weight)).
Chilimbi et al. in view of Ricks et al. in view of Yang et al. in view of Stromatias et al. in view of Boni et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Chilimbi et al. and Ricks et al., Yang et al., Stromatias et al. with Boni et al., with motivation adjust a position of a decimal point of the accumulated update value. “However, the precision of a floating point number representation utilizing a large .
Claims 13-14 and 29 are rejected under 35 U.S.C. 103 as being unpatentable over Chilimbi et al. in view of Ricks et al. in further view of Gysel (“Ristretto: Hardware-Oriented Approximation of Convolutional Neural Networks”)
Regarding Claim 13,
Chilimbi et al. in view of Ricks et al. teaches the method of claim 1, wherein
Chilimbi et al. in view of Ricks et al. does not appear to explicitly teach the weight is a floating point value comprising a first sign bit, a first exponent part, a first mantissa part, and a first bias, the accumulated update value is a floating point value comprising a second sign bit, a second exponent part, a second mantissa part, and a second bias, and the updating comprises adding an effective value of the accumulated update value included in an effective number range of the weight to the weight
However, Gysel, teaches the weight is a floating point value comprising a first sign bit, a first exponent part, a first mantissa part, and a first bias (Gysel, Figure 6.1 and Section 6.2, “According to IEEE-754 standard, single precision numbers have 1 sign bit, 8 exponent bits and 23 mantissa bits. The mantissa’s first bit (always ’1’) is added implicitly, and the stored exponent is biased by 127” teaches the IEEE-754 standard (corresponds to the floating point value) consisting of a sign bit, exponent part, and a mantissa part).
the accumulated update value is a floating point value comprising a second sign bit, a second exponent part, a second mantissa part, and a second bias (Gysel, Figure 2.10 and Section 2.6.1, “These two layer types, which are the most resource-demanding part of a deep network, require the same arithmetic operations, namely a series of multiplication-and-accumulation (MAC). In this thesis we simulate the arithmetic of a hardware accelerator. The simulated data path is shown in Figure 2.9. The difference between this simulated data path and the original full precision data path is the quantization step of weights, layer inputs, and layer outputs. Therefore the condensed networks will suffer from quantization errors, which can affect the network accuracy” teaches quantization of network weight (corresponds to update value) and accumulation of the results (corresponds to the accumulated update value) with added bias (corresponds to a second bias). Section 2.6.1.1, “In order to simplify simulation of hardware, our framework uses 32-bit floating point for accumulation” teaches utilizing floating point for accumulation. Figure 6.1 and Section 6.2, “According to IEEE-754 standard, single precision numbers have 1 sign bit, 8 exponent bits and 23 mantissa bits. The mantissa’s first bit (always ’1’) is added implicitly, and the stored exponent is biased by 127” teaches the IEEE-754 standard (corresponds to the floating point value) consisting of a sign bit, exponent part, and a mantissa part).
Gysel, Section 9.2, “Ristretto can condense any 32-bit floating point network to either fixed point, minifloat or integer power of two parameters. Ristretto’s quantization flow has five stages (Figure 9.1). In the first step, the dynamic range of the weights is analyzed to find a compressed number representation” teaches the stages of Ristretto’s quantization flow (corresponds to updating) and finding a compressed number representation with the dynamic range of the weights (corresponds to the effective number range of the weight to the weight). Section 9.4, “Ristretto brews a condensed network with reduced precision weights and layer activations. For simulation of the forward propagation in hardware, Ristretto uses full floating point for accumulation” teaches Ristretto utilizing full floating point for accumulation of the reduced precision weights (corresponds to effective value of the accumulated update value)).  
Chilimbi et al. in view of Ricks et al. in view of Gysel are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Chilimbi et al. and Ricks et al. with Gysel, with motivation of the weight is a floating point value comprising a first sign bit, a first exponent part, a first mantissa part, and a first bias, the accumulated update value is a floating point value comprising a second sign bit, a second exponent part, a second mantissa part, and a second bias, and the updating comprises adding an effective value 
Regarding Claim 14,
Chilimbi et al. in view of Ricks et al. in view of Gysel teaches the method of claim 13, further comprising
Gysel further teaches adjusting the second bias of the accumulated update value (Gysel, Section 6.4 Pg. 39-40, “The data path of convolutional and fully connected layers is depicted in Figure 6.2. For simplicity, we only consider fixed precision arithmetic, i.e., all number categories shared the same minifloat format. Similar to the fixed point data path, network parameters and layer inputs are multiplied and accumulated. Input to each multiplier is a pair of numbers, each in minifloat format. The output of each multiplier is 3 bits wider than the input numbers. In a next step, the multiplication results are accumulated in full precision. In a last step the bias is added in minifloat format, and the final result is trimmed to minifloat” teaches determining and adding the bias of the multiplied and accumulated network parameters and layer inputs (corresponds to accumulated update value).  
Chilimbi et al. in view of Ricks et al. in view of Gysel are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Chilimbi et al. and Ricks et al. with Gysel, with motivation of adjusting the second bias of the accumulated update value. “We present Ristretto, a fast and automated framework for CNN approximation. Ristretto simulates the hardware arithmetic of a custom hardware accelerator. The framework reduces the bit-width of network parameters and outputs of resource-intense layers, which reduces the chip area for multiplication units significantly. Alternatively, Ristretto can remove the need for multipliers altogether, resulting in an adder-only arithmetic. The tool fine-tunes trimmed networks to achieve high classification accuracy. Since training of deep neural networks can be time-consuming, Ristretto uses highly optimized routines which run on the GPU. This enables fast compression of any given network” (Gysel, Abstract). The proposed teaching is beneficial in that it reduces the chip area for multiplication units significantly, achieves high classification accuracy, and enables fast compression of any given network.
Regarding Claim 29,
Chilimbi et al. in view of Ricks et al. teaches the apparatus of claim 18, wherein
Chilimbi et al. in view of Ricks et al. does not appear to explicitly teach the weight is a floating point value comprising a first sign bit, a first exponent part, a first mantissa part, and a first bias, the accumulated update value is a floating point value comprising a second sign bit, a second exponent part, a second mantissa part, and a second bias, and the updating comprises adding an effective value of the accumulated update value included in an effective number range of the weight to the weight
However, Gysel, teaches the weight is a floating point value comprising a first sign bit, a first exponent part, a first mantissa part, and a first bias (Gysel, Figure 6.1 and Section 6.2, “According to IEEE-754 standard, single precision numbers have 1 sign bit, 8 exponent bits and 23 mantissa bits. The mantissa’s first bit (always ’1’) is added implicitly, and the stored exponent is biased by 127” teaches the IEEE-754 standard (corresponds to the floating point value) consisting of a sign bit, exponent part, and a mantissa part).
the accumulated update value is a floating point value comprising a second sign bit, a second exponent part, a second mantissa part, and a second bias (Gysel, Figure 2.10 and Section 2.6.1, “These two layer types, which are the most resource-demanding part of a deep network, require the same arithmetic operations, namely a series of multiplication-and-accumulation (MAC). In this thesis we simulate the arithmetic of a hardware accelerator. The simulated data path is shown in Figure 2.9. The difference between this simulated data path and the original full precision data path is the quantization step of weights, layer inputs, and layer outputs. Therefore the condensed networks will suffer from quantization errors, which can affect the network accuracy” teaches quantization of network weight (corresponds to update value) and accumulation of the results (corresponds to the accumulated update value) with added bias (corresponds to a second bias). Section 2.6.1.1, “In order to simplify simulation of hardware, our framework uses 32-bit floating point for accumulation” teaches utilizing floating point for accumulation. Figure 6.1 and Section 6.2, “According to IEEE-754 standard, single precision numbers have 1 sign bit, 8 exponent bits and 23 mantissa bits. The mantissa’s first bit (always ’1’) is added implicitly, and the stored exponent is biased by 127” teaches the IEEE-754 standard (corresponds to the floating point value) consisting of a sign bit, exponent part, and a mantissa part).
the one or more processors are further configured to add an effective value of the accumulated update value included in an effective number range of the weight to the weight (Gysel, Abstract, “Since training of deep neural networks can be time-consuming, Ristretto uses highly optimized routines which run on the GPU. This enables fast compression of any given network” teaches Ristretto being implemented on the GPU (corresponds to the processor). Section 9.2, “Ristretto can condense any 32-bit floating point network to either fixed point, minifloat or integer power of two parameters. Ristretto’s quantization flow has five stages (Figure 9.1). In the first step, the dynamic range of the weights is analyzed to find a compressed number representation” teaches the stages of Ristretto’s quantization flow (corresponds to updating) and finding a compressed number representation with the dynamic range of the weights (corresponds to the effective number range of the Section 9.4, “Ristretto brews a condensed network with reduced precision weights and layer activations. For simulation of the forward propagation in hardware, Ristretto uses full floating point for accumulation” teaches Ristretto utilizing full floating point for accumulation of the reduced precision weights (corresponds to effective value of the accumulated update value)).  
Chilimbi et al. in view of Ricks et al. in view of Gysel are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Chilimbi et al. and Ricks et al. with Gysel, with motivation of the weight is a floating point value comprising a first sign bit, a first exponent part, a first mantissa part, and a first bias, the accumulated update value is a floating point value comprising a second sign bit, a second exponent part, a second mantissa part, and a second bias, and the one or more processors are further configured to add an effective value of the accumulated update value included in an effective number range of the weight to the weight. “We present Ristretto, a fast and automated framework for CNN approximation. Ristretto simulates the hardware arithmetic of a custom hardware accelerator. The framework reduces the bit-width of network parameters and outputs of resource-intense layers, which reduces the chip area for multiplication units significantly. Alternatively, Ristretto can remove the need for multipliers altogether, resulting in an adder-only arithmetic. The tool fine-tunes trimmed networks to achieve high classification accuracy. Since training of deep neural networks can be time-consuming, Ristretto uses highly optimized routines which run on the GPU. .
Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over Chilimbi et al. in view of Ricks et al. in view of Gysel and in further view of Courbariaux et al. (“TRAINING DEEP NEURAL NETWORKS WITH LOW PRECISION MULTIPLICATIONS”)
Regarding Claim 15,
Chilimbi et al. in view of Ricks et al. in view of Gysel teaches the method of claim 14 wherein the adjusting comprises
Chilimbi et al. in view of Ricks et al. in view of Gysel does not appear to explicitly teach increasing the second bias in response to the second exponent of the accumulated update value being greater than the threshold value and decreasing the second bias in response to the accumulated update value being smaller than a second threshold value
However, Courbariaux et al., teaches increasing the second bias in response to the second exponent of the accumulated update value being greater than the threshold value (Courbariaux et al., Algorithm 2 and Section 5 Pg. 4, “In practice, we associate each layer’s weights, bias, weighted sum, outputs (post-nonlinearity) and the respective gradients vectors and matrices with a different scaling factor. Those scaling factors are initialized with a global value. The initial values can also be found during the training with a higher precision format. During the training, we update those scaling factors at a given frequency, following the policy described in Algorithm 2” teaches the scaling factor of the bias being greater than the maximum overflow rate (corresponds to the threshold value). Section 6 Pg. 4, “We use a higher precision for the parameters during the updates than during the forward and backward propagations, respectively called fprop and bprop. The idea behind this is to be able to accumulate small changes in the parameters (which requires more precision) and while on the other hand sparing a few bits of memory bandwidth during fprop” teaches the accumulating the parameter updates (corresponds to the accumulated update value) during training. Table 4 Pg. 5, shows the results of the bit width of the parameters updates as well as the datasets).
decreasing the second bias in response to the accumulated update value being smaller than a second threshold value (Courbariaux et al., Algorithm 2 and Section 5 Pg. 4, “In practice, we associate each layer’s weights, bias, weighted sum, outputs (post-nonlinearity) and the respective gradients vectors and matrices with a different scaling factor. Those scaling factors are initialized with a global value. The initial values can also be found during the training with a higher precision format. During the training, we update those scaling factors at a given frequency, following the policy described in Algorithm 2” teaches the scaling factor of the bias being smaller than the maximum overflow rate (corresponds to the threshold value). Section 6 Pg. 4, “We use a higher precision for the parameters during the updates than during the forward and backward propagations, respectively called fprop and bprop. The idea behind this is to be able to accumulate small changes in the parameters (which requires more precision) and while on the other hand sparing a few bits of memory bandwidth during fprop” teaches the accumulating the parameter updates (corresponds to the accumulated update value) during training. Table 4 Pg. 5, shows the results of the bit width of the parameters updates as well as the datasets). 
Chilimbi et al. in view of Ricks et al. in view of Gysel in view of Courbariaux et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Chilimbi et al., Ricks et al. and Gysel with Courbariaux et al., with motivation of increasing the second bias in response to the second exponent of the accumulated update value being greater than the threshold value and decreasing the second bias in response to the accumulated update value being smaller than a second threshold value. “We have shown that: • Very low precision multipliers are sufficient for training deep neural networks. • Dynamic fixed point seems well suited for training deep neural networks. • Using a higher precision for the parameters during the updates helps” (Courbariaux et al., Conclusion). The proposed teaching is beneficial in that it has shown that low precision is sufficient for training networks.
Claim 16 is rejected under 35 U.S.C. 103 as being unpatentable over Chilimbi et al. in view of Ricks et al. in view of Gysel in view of Courbariaux et al. in further view of Chang et al. (“Adaptive wavelet thresholding for image denoising and compression”)
Regarding Claim 16,
Chilimbi et al. in view of Ricks et al. in view of Gysel in view of Courbariaux et al. teaches the method of claim 15, wherein
Chilimbi et al. in view of Ricks et al. in view of Gysel in view of Courbariaux et al. does not appear to explicitly teach the second threshold value is 1/b times the threshold value; and b is a natural number.  
However, Chang et al., teaches the second threshold value is 1/b times the threshold value; and b is a natural number ((Chang et al., Equation 18 and Section II.B Pg. 2000, “The parameter β does not explicitly enter into the expression of TB(σX), only the signal standard deviation, σX, does. Therefore it suffices to estimate directly σX or σX2. Recall the observation model is Y = X + V, with X and V independent of each other, hence… where σY2 is the variance of Y. Since Y is modeled as zero-mean σY2, can be found empirically by… where n X n is the size of the subband under consideration… In the case that σ̂2 ≥ σ̂2Y, σ̂X, is taken to be 0. That is, T̂B (σ̂X) is ∞, or, in practice, T̂B (σ̂X) = max (|Yij|), and all coefficients are set to 0. This happens at times when σ is large (for example, σ > 20 for a grayscale image). To summarize, we refer to our method as BayesShrink which performs soft-thresholding, with the data-driven, subband-dependent threshold󠅓” teaches determining the noise variance of σY2 (corresponds to the second threshold value), which is a variance of Y and is equivalent to the threshold TB, by 1/n2 (n corresponds to a natural number) times the summation of the variance Y2ij (corresponds to the threshold TB). Section II.A Pg. 1536, “Fig. 5(a) compares the optimal hard-threshold, Th*(σX, 1), and TBh(σX) to the soft-thresholds T *(σX, 1), and TB(σX)” teaches the soft-threshold TB (σX)).
.
Claim 34 is rejected under 35 U.S.C. 103 as being unpatentable over Chilimbi et al. in view of Ricks et al. in view of Courbariaux et al. 
Regarding Claim 34,
Chilimbi et al. in view of Ricks et al. teaches the method of claim 31, further comprising
Rick et al. further teaches wherein the updating comprises updating the weight using the adjusted accumulated update value (Ricks et al., Section 3 Pg. 4, “The QNN in Figure 1 is an example of such a network, with sufficient complexity to compute the XOR function. Each input node i is represented by a register, |αi i. The two hidden nodes compute a weighted sum of the inputs, |ψi i1 and |ψi i2, and compare the sum to a threshold weight, |ψi i0. If the weighted sum is greater than the threshold the node goes high. The |βik represent internal calculations that take place at each node. The output layer works similarly, taking a weighted sum of the hidden nodes and checking against a threshold. The QNN then checks each computed output and compares it to the target output, |Ωi j sending |ϕi j high when they are equivalent. The performance of the network is denoted by |ρi, which is the number of computed outputs equivalent to their corresponding target output.” teaches updating the weight using the weighted sum (corresponds to the accumulated update value) and comparing the updated value to the threshold value).
Chilimbi et al. in view of Ricks et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Chilimbi et al. with Ricks et al., with motivation of wherein the updating comprises updating the weight using the adjusted accumulated update value. “A randomized version avoids some of the exponential increases in complexity with problem size. This algorithm is exponential in the number of qubits of each node’s 
Chilimbi et al. in view of Ricks et al. does not appear to explicitly teach adjusting a size of the accumulated update value, by a factor, based on a comparison between a second threshold value and either one or both of an average value of the individual update values and the accumulated update value
However, Courbariaux et al., teaches adjusting a size of the accumulated update value, by a factor, based on a comparison between a second threshold value and either one or both of an average value of the individual update values and the accumulated update value (Courbariaux et al., Section 2 Pg. 2, “Applying a deep neural network (DNN) mainly consists in convolutions and matrix multiplications. The key arithmetic operation of DNNs is thus the multiply-accumulate operation. Artificial neurons are basically multiplier-accumulators computing weighted sums of their inputs. The cost of a fixed point multiplier varies as the square of the precision (of its operands) for small widths while the cost of adders and accumulators varies as a linear function of the precision” teaches fixed point multiplier-accumulators determining the weighted sum (corresponds to the accumulated update value) of their inputs (corresponds to the individual update values) during Section 4 Pg. 3, “Fixed point formats consist in a signed mantissa and a global scaling factor shared between all fixed point variables. The scaling factor can be seen as the position of the radix point. It is usually fixed, hence the name “fixed point”. Reducing the scaling factor reduces the range and augments the precision of the format. The scaling factor is typically a power of two for computational efficiency (the scaling multiplications are replaced with shifts). As a result, fixed point format can also be seen as a floating point format with a unique shared fixed exponent, as illustrated in figure 1” teaches adjusting the radix point of the fixed point variable (corresponds to adjusting of the size) by a scaling factor. Section 5 Pg. 4, “In practice, we associate each layer’s weights, bias, weighted sum, outputs (post-nonlinearity) and the respective gradients vectors and matrices with a different scaling factor. Those scaling factors are initialized with a global value. The initial values can also be found during the training with a higher precision format. During the training, we update those scaling factors at a given frequency, following the policy described in Algorithm 2” teaches the layer’s weights, bias, weight sum, outputs and respective gradient vectors and matrices (corresponds to the individual update values and accumulated update value) being assessed with an initial value (corresponds to the threshold value).
Chilimbi et al. in view of Ricks et al. in view of Courbariaux et al. are analogous art because they are from the same field of endeavor and are from the same problem solving area. Namely, they pertain to the field of “neural network”. Therefore, it would have been obvious to one of ordinary skill in the art before the effective filling date of the claimed invention to modify the invention of Chilimbi et al., and Ricks et al. with .

Allowable Subject Matter
Claims 10-11 and 27-28 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Henry T Nguyen whose telephone number is (571)272-8860. The examiner can normally be reached Monday-Friday 7:30am-5:30pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/HENRY TRONG NGUYEN/Examiner, Art Unit 2125                                                                                                                                                                                             
/KAMRAN AFSHAR/Supervisory Patent Examiner, Art Unit 2125