DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Objections
Claims 8, 9 is objected to because of the following informalities:  “the register” (claim 8, line 4). Suggestion for correction: Is it referring to the “first register” in claim 1, line 4? Amend: “the first register” in the claim. See claim 9 for the same issue. Appropriate correction is required.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 5, 8, 9, 10, 13, 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lavigueur et al. 20170236053 in view of Barry et al. 20040093484.

a register-file-memory [data memory 72], configured to store input-data [input] (see fig.2 [data memory 72], [0023], each PE 70 may have a separate corresponding program (P) memory 74 and corresponding data (D) memory 72, such that there is a one-to-one relationship between each PE70, one program memory 74 and one data memory 72.  Alternatively, one or more PEs 70 may be coupled to one or more data memories 74 or program memories 72); 
a first-processing-element-slice [one of PE0-PEN 70] comprising (see one of the PE0-PEN 70 in fig.2); 
a first-register (See [0029] PE’s vector register) configured to store first-register-data [input]; and 
a first-processing-element [one of PE0-PEN 70] (fig.2) configured to apply an arithmetic and logic operation on the first-register-data [input] in order to provide first-convolution-output-data [feature map 88] (See [0029], a PE 70 may include a set of vector registers, multiple MACs to enable parallel convolution calculation to generate multiple elements of an output matrix, such as feature map 88, and special hardware for non-linear transformations (neuron activation simulation)); 

a second-register (See [0029] another PE’s vector register) configured to store second-register-data [input]; and 
a second-processing-element [another one of PE0-PEN 70] (fig.2) configured to apply a convolutional neural network algorithm [parallel convolution calculation] (see fig.2 is a simplified block diagram of a convolutional neural network (CNN) architecture 50) to the second-register-data [input] in order to provide second-convolution- output-data [output matrix of another PE 70] (See [0029], a PE 70 may include a set of vector registers, multiple MACs to enable parallel convolution calculation to generate multiple elements of an output matrix, such as feature map 88, and special hardware for non-linear transformations (neuron activation simulation)); and 
a controller  [DMA with the data transfer instructions]  (see [0030], The data transfer instructions may enable a PE 70 to employ DMA including sequencing loads (resp. stores) between external memory and local memory, push and pop access with FIFOs 62, and load and store vector registers in local memory (data memory 74 in some embodiments))  configured to: 
load input-data from the register-file-memory [data memory 72]  into the first-register [vector register] as the first-register-data [input] (see [0027], in some embodiments, a CNN layer nonlinear activation function 86 may be implemented using an LUT mapped in the 
tightly-coupled data memory.  The LUT may be optimized by using large data word 
accesses in the data memory 72 and by keeping a number of previously looked-up 
data words in local registers in some embodiments; see also [0030], the data transfer 

, and load (i.e. by loading the register): 
 input-data [input] from the register-file-memory [data memory 72 (local memory)] (see the load and store vector register in local memory in [[0030]).
 	Lavigueur does not but Barry teaches:
 	Load the first-register-data [source operand] from the first-register [PE source register] into the second-register [PE destination register] as the second-register-data [result operand]. (See Barry, the PE register-to-register instruction for transferring the operand from a source register to the destination register in [0028]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to load the first-register-data from the first-register into the second-register as the second-register-data, as claimed because one of ordinary skill in the art should be able to recognize the application of a known technique, such as the register-to-register transfer as taught by Barry, to a known device/method, such as the CNN architecture that includes a plurality of PEs as taught in Lavigueur, for the purpose of using the PE registers for transferring the source operand into the result destination (See Barry, [0028]. MPEP 2143 KSR Example D), and it could be readily accomplished by reconfiguring the register-to-register transfer instruction of Barry into Lavigueur so that it could be recognized by Lavigueur.
As to claim 5, Lavigueur teaches: 

wherein each of the one or more further-processing-element-slices [PE0-PEN] comprises: 
a further-register configured to store further-register-data (see [0029], a PE 70 may include a set of vector registers, multiple MACs to enable parallel convolution calculation to generate multiple elements of an output matrix, such as feature map 88, and special hardware for non-linear transformations (neuron activation simulation); 
a further-processing-element [PEN] configured to apply a convolutional neural network algorithm [parallel convolution calculation] (see the CNN architecture in fig.2) to the further-register-data in order to provide further-convolution- output-data [feature map 88] (see fig.2, [0029], a PE 70 may include a set of vector registers, multiple MACs to enable parallel convolution calculation to generate multiple elements of an output matrix, such as feature map 88, and special hardware for non-linear transformations (neuron activation simulation). An output matrix, such as feature map 88, may comprise multiple consecutive element positions of the same matrix or same element position of different output matrices.  In some embodiments, a PE 70 may also enable data to be moved in and out of his memory while performing computations (on other data) in parallel); 
wherein the controller configured to: 
load: input-data [input] from the register-file-memory [data memory 72 (local memory)] (see the load and store vector register in local memory in [[0030]) or 

As to claim 8, Lavigueur teaches: 
register-file-memory [data memory 72] comprises 
a register-file-block [LUT ] associated with each of the processing-element-slices  (see each PE 70 may have a separate corresponding program (P) memory 74 and corresponding data (D) memory 72, such that there is a one-to-one relationship between each PE70, one program memory 74 and one data memory 72 in [0023]; see also an LUT mapped in the tightly-coupled data memory.  The LUT may be optimized by using large data word accesses in the data memory 72 and by keeping a number of previously looked-up data words in local registers in some embodiments in [0027]); and 
the controller [DMA] is configured to load input-data [data word]      into the register   [local registers] of a processing- element-slice [PE 70] from the associated register-file-block [LUT]. (See an LUT mapped in the tightly-coupled data memory.  The LUT may be optimized by using large data word accesses in the data memory 72 and by keeping a number of previously looked-up data words in local registers in some embodiments in [0027]; for the controller, see the data transfer instructions may enable a PE 70 to employ DMA including sequencing loads (resp. stores) between external memory and local memory, push and pop access with FIFOs 62, 
As to claim 9, Lavigueur teaches wherein the controller (e.g. DMA) is configured to load input-data [data word] into the register [local register] of a processing-element-slice [PE] from a register- file-block [LUT] associated with a different processing-element-slice [more PE] (See an LUT mapped in the tightly-coupled data memory.  The LUT may be optimized by using large data word accesses in the data memory 72 and by keeping a number of previously looked-up data words in local registers in some embodiments in [0027]; see  alternatively, one or more PEs 70 may be coupled to one or more  data memories 74 (2) or program memories 72 (4) in [0023]; for the controller, see the data transfer instructions may enable a PE 70 to employ DMA including sequencing loads (resp. stores) between external memory and local memory, push and pop access with FIFOs 62, and load and store vector registers in local memory (data memory 74 (72) in some embodiments)).
As to claim 10, Lavigueur teaches further comprising a look-up-table [look-up table] configured to apply a non-linear function [nonlinear activation] to the convolution-output-data [Multiple 2d convolutions accumulated with bias: Bias + ∑] provided by each of the processing-elements [PEs 70] in order to provide feature-map-output-data [feature map O0-m]. (See fig.3; para [0025], FIG. 3 is a simplified block diagram of activities in a CNN layer 80 such as layer 40 shown in FIG. 1C in accordance with some embodiments of the presently disclosed method and apparatus.  In some embodiments, the PEs 70 of the CNN architecture 50 have a specialized instruction set to optimize both the main computation steps of the CNN and the typical data movements required to implement those computational steps (i.e., reading data from and 
As to claim 13, Lavigueur teaches the human-machine-interface system of claim 1, wherein the input- data is representative of non-contact human-machine-interface-signals [e.g. speech/image].  (see fig.2 [50] for the CNN architecture; see also fig.3 for details of a CNN  layer of fig.2; see [0006] [0007] for the background teaching and applications of the Convolutional Neural Network (CNN), such as pattern recognition, visual object detection, speech recognition, captured images or audio sequences; see also [0009] for the summary of invention for optimized CNN architecture and applications, such as computer vision, augmented reality, advanced driver assistance systems, video surveillance and robotics): 
As to claim 14, Lavigueur teaches the human-machine-interface system of claim 1, wherein the input- data comprises analogue sensor data [e.g. speech/image].  (see fig.2 [50] for the CNN architecture; see also fig.3 for details of a CNN  layer of fig.2; see [0006] [0007] for the background teaching and applications of the Convolutional Neural Network (CNN), such as pattern recognition, visual object detection, speech recognition, captured images or audio sequences; see also [0009] for the summary of invention for optimized CNN architecture and applications, such as computer vision, augmented reality, advanced driver assistance systems, video surveillance and robotics).
2  is/are rejected under 35 U.S.C. 103 as being unpatentable over Lavigueur et al. 20170236053 in view of Barry et al. 20040093484, as applied to claim 1 above, and in further view of Grinberg et al. 20160283441.
As to claim 2, neither Lavigueur nor Barry but Grinberg teaches the human-machine-interface system of claim 1, wherein the controller (see the PE with SIMD copy instruction as a controller, [0044]) is configured to: 
load a first subset of the input-data [U4k] from the register-file-memory [210] into the first-register DR0 648] as the first-register-data [U4k]; and 
load: a second subset of the input-data [U4k+1] from the register-file-memory [210] into the second-register [DR1 650] as the second-register-data [U4k+1]; 
wherein the first subset of input-data [U4k] is different to the second subset of input-data [U4k+1] (See fig.6, [U4k] ≠ [U4k+1]; see the SIMD copy instruction for copying the vector data subsets into the vector data registers [0044]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include a controller to load a first subset of the input-data from the register-file-memory into the first-register as the first-register-data; and load: (i) a second subset of the input-data from the register-file-memory into 25the second-register as the second-register-data; wherein the first subset of input-data is different to the second subset of input-data, as claimed (see the details of claim mapping above)  because one of ordinary skill in the art should be able to recognize the application of a known technique, such as the loading/copying of the subsets of the vector data [U0-3] into the respective registers [DR0-3] as taught by Barry, to a known device/method, such as the CNN architecture that includes a 0-U3] into the corresponding  data vector registers (See Grinberg, [0044]. MPEP 2143 KSR Example D).
Claims 3, 4  is/are rejected under 35 U.S.C. 103 as being unpatentable over Lavigueur et al. 20170236053 in view of Barry et al. 20040093484, as applied to claim 1 above, and in further view of Gorshtein et al. 5923871.
As to claim 3, neither Lavigueur nor Barry nor but Gorshtein teaches:
an intermediate-register [REG. 1 1023]; and wherein the controller (see fig.10) is configured to: 
load the first-register-data [operand 1 1012/operand 2 1014] from the first-register [OP1 register/OP2 register] into the intermediate- register [REG. 1 1023] as intermediate-data [intermediate result] (see the first intermediate register 1023 accepting two partial product vectors of the product fraction from the adder tree 1022 and generates intermediate results of a product exponent computation and control signals in col.17, lines 52-55. See the partial products are originally from the multiplication of the operands 1012, 1014  by the multiplication unit 1000 in col.17, lines 1-12 and added by the adder tree 1022 in col.17, lines 39-41); and 
load the intermediate-data [intermediate result] from the intermediate-register [REG. 1 1023] into the second- register [result register 1031] as the second-register-data [result data] (see the final result is obtained by the fraction adder 1024 for a full product, col.17, lines 56-62, and selected by the result multiplexer 1030, col.18, lines 23-26, and finally the result register 1031 holds the result data in col.18, lines 29).

As to claim 4, neither Lavigueur nor Barry but Gorshtein teaches:
a plurality of intermediate-registers [1019][1020][1022][1023][1024][1028][1030] (Note: each of the units has its input and output ports, which are registers), serially connected between the first-register [OP1 register/OP2 register] and the second- register [result register 1031] (see fig.10).
The reason of obviousness in claim 3 is also applicable in claim 4 and not being repeated herein.
Claims 6, 7 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lavigueur et al. 20170236053 in view of Barry et al. 20040093484, as applied to claim 1 above, and in further view of Maaninen 20150199963.

a weights-memory [RAM weight Matrix Row 14], configured to provide weights-data [weight] to each processing-element [MAC unit 10] of the processing-element-slices [MAC units 10] (see fig.4 [RAM weight Matrix Row 14], para [0048]  for the memory that stores weight; see fig.4 [MAC 10], para [0051] for plurality of MAC units to achieve real-time speech recognition); 
wherein each processing-element [MAC unit 10] of the processing-element-slices [MAC units 10]  is configured to apply the arithmetic and logic operation based on the weights-data (see after the buffers are loaded with data, MAC 10 may perform a multiply-accumulate operation by multiplying a weight matrix row with input matrix, adding the bias vector and accumulating the result, Once all rows of  the weight matrix for the layer are processed, hardware accelerator 312 may start processing a next layer of the neural network by fetching a new bias vector, starting to decode the next weight matrix, etc., and using the preceding layer's output circulated through internal memory as MAC input for the next layer in [0049]).
It would have been obvious to one of ordinary skill in the art before the effective fling date of the claimed invention to include a weights-memory, configured to provide weights-data to each processing-element of the processing-element-slices; wherein each processing-element of the processing-element-slices is configured to apply the arithmetic and logic operation based on the weights-data, as claimed (see the details of the claim mapping above) because one of ordinary skill in the art should be able to recognize the application of a known technique, such as the weight memory that provides the weight to the MAC for computation as taught in Maaninen, to a known device/method, such as the CNN architecture that includes a plurality of PEs as taught in Lavigueur, for the purpose of performing a multiply-accumulate operation by 
As to claim 7, neither Lavigueur nor Barry but Maaninen teaches:
a bias-memory [RAM Bias Vector 16], configured to provide bias-data [bias] to each processing-element [MAC unit 10]  of the processing-element-slices [MAC units 10] (see fig.4 [RAM Bias Vector 16], para [0048]  for the memory that stores bias; see fig.4 [MAC 10], para [0051] for plurality of MAC units to achieve real-time speech recognition); 
wherein each processing-element [MAC unit 10]  of the processing-element-slices [MAC units 10] is configured to apply the arithmetic and logic operation based on the bias-data [bias] (see after the buffers are loaded with data, MAC 10 may perform a multiply-accumulate operation by multiplying a weight matrix row with input matrix, adding the bias vector and accumulating the result, Once all rows of the weight matrix for the layer are processed, hardware accelerator 312 may start processing a next layer of the neural network by fetching a new bias vector, starting to decode the next weight matrix, etc., and using the preceding layer's output circulated through internal memory as MAC input for the next layer in [0049]).
It would have been obvious to one of ordinary skill in the art before the effective fling date of the claimed invention to include a bias-memory, configured to provide bias-data to each processing-element of the processing-element-slices; wherein each processing-element of the processing-element-slices is configured to apply the arithmetic and logic operation based on the bias-data, as claimed (see the details of the claim mapping above) because one of ordinary skill in the art should be able to recognize the application of a known technique, such .
Claims 11, 12 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lavigueur et al. 20170236053 in view of Barry et al. 20040093484, as applied to claim 10 above, and in further view of Tsai et al. 20180173676.
As to claim 11, neither Lavigueur nor Barry but Tsai teaches wherein the controller (fig.3 matrix engine 230) is configured to write (i.e. outputting) the feature-map-output-data [output feature map] into the register-file-memory [buffer 30]. (See [0035], the output buffer 340 may be used to temporarily store values of an output feature map (i.e., the convolution result of the pixel values and the filter weights), or at least a portion of the output feature map).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include the controller configured to write the feature-map-output-data into the register-file-memory, as claimed because one of ordinary skill in the art should be able to recognize the application of a known technique, such as Tsai’s controller for outputting the feature map output to a buffer, to a known device/method, such as the CNN architecture that includes a plurality of PEs as taught in Lavigueur, for the purpose of temporarily storing values of an output feature map (i.e., the convolution result of the pixel 
As to claim 12, Lavigueur teaches the human-machine-interface system of claim 11, wherein a processing-element [PE] is configured to add [sum] the feature-map-output-data (see fig.3, the feature map is the input matrix of the convolution function) in order to provide output-classification- data [classification]. (See [0040], the Conv3( ) function may be a special scenario/state occurring in the last layer of a CNN-application in which classification may be performed by summing all the input matrix after trivial convolution with a 1 x 1 kernel. For the PE, see [0030], The neuron activation simulation instructions may enable a PE 70 to perform various non-linear operation applied on the convolution results to simulate neuron activations).
Claim 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lavigueur et al. 20170236053 in view of Barry et al. 20040093484, as applied to claim 1 above, and in further view of Wagner et al. 20160313801.
As to claim 15, neither Lavigueur nor Barry but Wagner teaches wherein the human- machine-interface system comprises a gesture recognition system. (See Wagner, the gesture recognition of the neural network CNN in [0182] and [0194]).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to include the human- machine-interface system comprises a gesture recognition system, as claimed because one of ordinary skill in the art should be able to recognize the application of a known technique, such as Wagner’s CNN for gesture recognition as cited above, to a known device/method, such as the CNN architecture that includes a .
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.  
a)  Johnson et al. 20170024632 is cited for the teaching of the convolution sums the products of image data with the corresponding elements of the weight matrix, resulting in a single value that is added to the bias to produce the signal (See [0053]);
b)  Molchanov et al. 20170206405 is cited for the teaching of for detecting and classifying dynamic hand gestures [0049] and the 3D convolution layer 205 performs 3D convolution on the training data stream to produce feature maps [0053].
c) Utku Aydonat et al. “An OpenCLTM Deep Learning Accelerator on Arria 10” is cited for the teaching of convolution neural networks, Section 2.1, ACM 2017.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DANIEL H PAN whose telephone number is (571)272-4172. The examiner can normally be reached M-F 8:30 am -5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jyoti Mehta can be reached on 571 270 3995. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.


DANIEL H. PAN
Examiner
Art Unit 2182



/DANIEL H PAN/             Primary Examiner, Art Unit 2182