DETAILED ACTION
This action is in response to communications filed on 12/29/2020 in which claims 1-20  have been amended; and claims 1-20 are still pending.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Priority
Applicant’s claim for benefit as a continuation application of International Patent Application No. PCT/CN2017/099991, filed August 31, 2017 is acknowledged.

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 03/09/2020 and 05/19/2020 have been considered by the examiner. Only content provided in English was considered by the examiner as noted in the IDS.
Drawings
The drawings were received on 10/23/2018.  These drawings are acceptable.

Response to Arguments
Applicant’s arguments filed 12/29/2020 have been fully considered.

Applicant’s remarks regarding the claim interpretation of claims under 35 USC § 112(f), have been fully considered and upon further review of amended claim limitation, the examiner notes that the 

Applicant’s arguments filed 12/29/2020 with respect to the 35 USC § 112(b) rejection, have been fully considered and upon further review of amended claim limitation, the rejection made in the previous office action has been withdrawn.

Applicant’s arguments with respect to the rejection of claims under 35 USC § 103, have been fully considered. Applicant’s arguments with respect to claim(s) 1-20 have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.
 See full rejection of amended claim in the current office action. 

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.


1, 2, 17, and 19 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Azarkhish et al. (NPL: “Neurostream: Scalable and Energy Efficient Deep Learning with Smart Memory Cubes”, hereinafter ‘Aza).

Regarding independent claim 1 limitations, Aza teaches a processing device, comprising:
a hardware main processing integrated circuit [in Fig.1(a) Host SoC; and host HD camera]; 
a plurality of hardware basic processing integrated circuits that are separate from the main processing integrated circuit; and [in Fig. 1(a): plurality of C cluster circuits connected to a SMC branch circuit separated from the main processing]
and a plurality of [Fig. 1(a): plurality of  SMC processing circuits as depicted in Fig. 1(b)], 
wherein each of the plurality of hardware branch processing circuits connects the main processing integrated circuit is  to a distinct subset of the plurality of hardware basic processing integrated circuits [as depicted in Fig. 1 each SMC connects the host to the subset of cluster processing ICs via the links, in pg. 12: Right Col. Sec. C: …The host system-on-chip (SoC) is only responsible for coordination and receiving the results. It does not send or receive data at a high bandwidth, yet we keep its serial link (Link0) always active, to make sure it can manage the other devices through that link. The other serial links, however, are turned on only when there is a data to send over them, and then turned off again…; connected to a distinct cluster subset of basic processing circuits in an SMC as depicted in Fig. 1(b);],
 each distinct subset comprising multiple hardware basic processing integrated circuits directly connected to the corresponding hardware branch processing circuit [each SMC NeuroCluster plurality of C cluster subset of basic circuits as depicted in Fig. 1(b), in pg. 5:Left Col: last para: NeuroCluster (Illustrated in Figure 1b) is a flexible gen-eral purpose clustered many-core platform, designed based on energy-efficient RISC-V processing-elements (PEs) [45] and NeuroStream (NST) co-processors (described in subsec-tion III-B), all grouped in tightly-coupled clusters… ; where each cluster processing IC subset are directly connected to each corresponding SMC IC hardware circuit, as depicted in Fig. 1, in pg. 5:Sec. III(A): NeuroCluster (Illustrated in Figure 1b) is a flexible gen-eral purpose clustered many-core platform, designed based on energy-efficient RISC-V processing-elements (PEs) [45] and NeuroStream (NST) co-processors (described in subsec-tion III-B), all grouped in tightly-coupled clusters… The cluster-interconnect has been designed based on the logarithmic-interconnect proposed in [49] to provide low-latency all-to-all connectivity inside the clusters. Also, the AXI-4 based global-interconnect, connecting the clusters, follows the same architecture as the SMC-Interconnect [24] to achieve a very high bandwidth.];  
wherein: the hardware main processing integrated circuit is configured to transmit data to the plurality of hardware branch processing circuits [Fig. 1(a): depicts how the processing circuits are configured to transmit data as noted by the above limitations, in pg. 12: Right Col. Sec. C: …The host system-on-chip (SoC) is only responsible for coordination and receiving the results. It does not send or receive data at a high bandwidth, yet we keep its serial link (Link0) always active, to make sure it can manage the other devices through that link. The other serial links, however, are turned on only when there is a data to send over them, and then turned off again…; sending data including input data,  in pg. 12: Right Col:  …The camera sends the images to the memory cubes over the highlighted links in Figure 1a, and each SMC executes ResNet on one complete frame, independently from the other cubes… and coefficients, in pg. 12 Sec. C: …Each SMC has a copy of the ConvNet coefficients inside its DRAM dies, and the coefficients have been preloaded once at the beginning…]; 
each of the plurality of hardware branch processing circuits is configured to forward the data transmitted by the hardware main processing integrated circuit to the distinct subset of the plurality of hardware basic processing integrated circuits connected thereto; [host configured to set data to the SMC branch circuits as discussed above as depicted in Fig. 1(a); and  SMCs configured to forward transmitted data to the set clusters ICs, in Fig. 1(b) and, in pg. 12 Sec. C: …Each SMC has a copy of the ConvNet coefficients inside its DRAM dies, and the coefficients have been preloaded once at the beginning. The host system-on-chip (SoC) is only responsible for coordination and receiving the results. It does not send or receive data at a high bandwidth, yet we keep its serial link (Link0) always active, to make sure it can manage the other devices through that link. The other serial links, however, are turned on only when there is a data to send over them, and then turned off again…; forwarding data from the host including input data,  in pg. 12: Right Col:  …The camera sends the images to the memory cubes over the highlighted links in Figure 1a, and each SMC executes ResNet on one complete frame, independently from the other cubes… and coefficients, in pg. 12 Sec. C: …Each SMC has a copy of the ConvNet coefficients inside its DRAM dies, and the coefficients have been preloaded once at the beginning…];
each of the plurality of hardware basic processing integrated circuits is configured to: receive a first set of data [image data] forwarded from the connected hardware branch processing unit, wherein different hardware basic processing integrated circuits receive different first set of data [Clusters configured as depicted in Fig.1 to receive data as image data, in pg. 12: Right Col:  …The camera sends the images to the memory cubes over the highlighted links in Figure 1a, and each SMC executes ResNet on one complete frame, independently from the other cubes…; input volumes as depicted in Fig. 3, in pg. 6 Left Col: … The input volume (e.g. the image or video frame) is loaded into this area before each run… Each cluster executes one 4D-tile at a time with all its NSTs working cooperatively to compute its final result inside the cluster’s SPM… ]; 
receive a second set of data forwarded from the connected hardware branch processing circuitAtty. Dkt. No. 10015-01-0002-USReply to Office Action of-3- LIU et al. September 29, 2020Application No. 16/168,778[each basic cluster circuits are configured to receive second set including coefficient K filters for computing operations as depicted in Fig. 3, in pg. 6 Right Col.: …The actual execution takes place layer-by-layer, each layer being parallelized over 16 clusters. Each cluster executes one 4D-tile at a time with all its NSTs working cooperatively to compute its final result inside the cluster’s SPM... The output dimensions of each tile are calculated directly from input width and height, filter dimensions, striding, and zero-padding parameters. 4D-tiles have three main features essential for near-memory acceleration of deep ConvNets:…; for performing convolution on a convolutional network layers, in pg. 2: Right. Col. Sec II(A): ConvNets are built from the connection of five classes of layers: convolutional (CONV), … CONV is the core building block of the ConvNets doing most of the computational heavy-lifting for feature extraction. It essentially consists of Multiply-and-accumulate (MAC) operations as shown below [28]: 
    PNG
    media_image1.png
    72
    610
    media_image1.png
    Greyscale
… c indexes the input channels (Cil), and K denotes the convolution kernels (a.k.a filters], 
wherein each hardware basic processing integrated circuit receives the same second set of data [same coefficient are loaded including: depicted in Fig. 3: as the convolutional kernel, in pg. 6 Left Col. Last para.: When a ConvNet such as GoogLeNet is selected for ex-ecution over our PIM system, first it is tiled using the 4D-tiling mechanism described in subsection IV-A. This proce-dure prepares it for parallel execution over the clusters, … Next, all coefficients are loaded in SMC’s DRAM…; for performing convolution on a convolutional network layers, in pg. 2: Right. Col. Sec II(A): ConvNets are built from the connection of five classes of layers: convolutional (CONV), … CONV is the core building block of the ConvNets doing most of the computational heavy-lifting for feature extraction. It essentially consists of Multiply-and-accumulate (MAC) operations as shown below [28]: 
    PNG
    media_image1.png
    72
    610
    media_image1.png
    Greyscale
… c indexes the input channels (Cil), and K denotes the convolution kernels (a.k.a filters)…];
perform a set of operations on the first and second sets of data received by that hardware basic processing integrated circuit [in pg. 2 Right Col. …ConvNets are built from the connection of five classes of layers: convolutional (CONV), activation (ACT), pooling (POOL), fully-connected (FC), and classification (CLASS)[28]. CONV is the core building block of the ConvNets doing most of the computational heavy-lifting for feature extraction. It essentially consists of Multiply-and-accumulate (MAC) operations as shown below… including convolution operations (CONV) on first and second data sets including the input image tile and kernels for generating output volume as depicted in Fig. 3, in pg. 6 Right Col.: ... A 4D-tile (illustrated in Figure 3a,b) is a subset of the input volume (called Input-tile) and output volume (Output-tile) of a convolutional layer (l) identified by the … the tile width and height of the input volume of layer l, and …the number of input and output channels to the tile. The output dimensions of each tile are calculated directly from input width and height, filter dimensions, striding, and zero-padding parameters. 4D-tiles have three main features essential for near-memory acceleration of deep ConvNets:…];
and return an operation result to the connected hardware branch processing circuit [returned operation results for synchronization, in pg. 6 Right Col: …Each cluster executes one 4D-tile at a time with all its NSTs working cooperatively to compute its final result inside the cluster’s SPM. Only at the end of each layer the clusters are synchronized…; or as partial sums operation results returned to SMC branch circuit’s DRAMs, in pg. 7 Left. Col.: Partial Computations: Tiling of channels … requires maintaining partial computations, as more than one input tile contribute to the result of each output tile…After all input tiles have been read once, activation and pooling are directly performed on the output tile D (again inside the SPM) and then D is written back to the DRAM by the associated PE…The A, B, and C regions of T l+1 are written to DRAM after T1l, T3l, and T4,l are computed, respectively, using small DMA chunks shown in Figure 3f…]; 
the plurality of hardware basic processing integrated circuits perform the respective sets of operations in parallel [in pg. 6 left Col. Last para.: When a ConvNet such as GoogLeNet is selected for ex-ecution over our PIM system, first it is tiled using the 4D-tiling mechanism described in subsection IV-A. This proce-dure prepares it for parallel execution over the clusters, and optimally partitions it to achieve the highest efficiency under given constraints such as on-die SPM and DRAM bandwidth usage...];  each of the plurality of hardware branch processing circuits is configured to forward the operation results returned from the distinct subset of the plurality of hardware basic processing integrated circuit connected thereto to the hardware main processing integrated circuit [Fig. 1 shows connection configure to return results to the SoC host, in pg. 12 Right Col. Sec C: … Each SMC has a copy of the ConvNet coefficients inside its DRAM dies, and the coefficients have been preloaded once at the beginning. The host system-on-chip (SoC) is only responsible for coordination and receiving the results…; including the computational results from the branch interconnect circuitry for synchronization of each plurality of cluster branch circuits, in pg. 6 Right Col: …Each cluster executes one 4D-tile at a time with all its NSTs working cooperatively to compute its final result inside the cluster’s SPM. Only at the end of each layer the clusters are synchronized…; and the coordination of the received partial sums received via DRAM connected to the Host SoC via interconnection branch circuitry as depicted in Fig. 1, in pg. 7: Left Col: Partial Computations: Tiling of channels… requires maintaining partial computations, as more than one input tile contribute to the result of each output tile… ]; 
and the hardware main processing integrated circuit is configured to perform a set of arithmetic operations in series on the operation results forwarded from the plurality of hardware branch processing circuits [main Host SoC is configured to coordinate operations and receive results, in pg. 12 Right Col. Sec C: … Each SMC has a copy of the ConvNet coefficients inside its DRAM dies, and the coefficients have been preloaded once at the beginning. The host system-on-chip (SoC) is only responsible for coordination (e.g. configured to perform a set of arithmetic operations in series on the operation results forwarded from the plurality of hardware branch processing circuits) and receiving the results…; and send instructions for the coordination of the received partial sums via DRAM connected to the Host SoC via interconnection branch circuitry as depicted in Fig. 1, in pg. 7 Left Col: … we perform the following steps to compute each output tile: Tile A (See Figure 3d) and the related filter coefficients (KAD) are fetched from the DRAM. Then, D = D+A∗KAD is computed inside the SPM (D containing partial sums of the output channels). Next, Tile B and KBD are fetched from the DRAM, and D = D + B ∗ KBD is computed, and so forth (e.g. arithmetic operations in series on the operation results forwarded from the plurality of hardware branch processing circuits). After all input tiles have been read once, activation and pooling are directly performed on the output tile D (again inside the SPM) and then D [the operation results] is written back to the DRAM by the associated PE. This mechanism reduces DRAM’s write bandwidth and puts more pressure on read bandwidth, given that shrunk data (after pooling and strided-convolution) are written back to DRAM (e.g. configured to perform a set of arithmetic, as pooling and stride convolution, operations in series on the operation results forwarded from the plurality of hardware branch processing circuits), once after several DRAM reads…].


    PNG
    media_image2.png
    495
    368
    media_image2.png
    Greyscale


    PNG
    media_image3.png
    451
    361
    media_image3.png
    Greyscale


Regarding independent claim 2 limitations, Aza teaches a processing device, for performing [in pg. 1 Right Col: …Convolutional neural networks (ConvNets) are known as the SoA ML algorithms specialized at BIC, loosely inspired by the organization of the human brain [4]. ConvNets process raw data directly, combining the classical models of feature extraction and classification into a single algorithm.,…; and pg. 2 Right Col: : …All these emerging DL models can be future targets for our PIM proposal, yet, in this paper we focus on ConvNets for image and video…]
using a first data set and a second data set according to an operation instruction  [Clusters configured as depicted in Fig.1 to receive data as image data, in pg. 12: Right Col:  …The camera sends the images to the memory cubes over the highlighted links in Figure 1a, and each SMC executes ResNet on one complete frame, independently from the other cube; …and coefficients, in pg. 12 Sec. C: …Each SMC has a copy of the ConvNet coefficients inside its DRAM dies, and the coefficients have been preloaded once at the beginning…; in pg. 12 Sec. C: …Each SMC has a copy of the ConvNet coefficients inside its DRAM dies, and the coefficients have been preloaded once at the beginning. The host system-on-chip (SoC) is only responsible for coordination and receiving the results. It does not send or receive data at a high bandwidth, yet we keep its serial link (Link0) always active, to make sure it can manage the other devices through that link. The other serial links, however, are turned on only when there is a data to send over them, and then turned off again…]
the processing device comprising: a hardware main processing integrated circuit; [in Fig.1(a) Host SoC; and host HD camera];
and a plurality of hardware basic processing integrated circuits; [in Fig. 1(a): plurality of C cluster circuits]
wherein: the hardware main processing integrated circuit is configured to: split the first data set into a plurality of distinct basic data blocks; [Fig. 1(a): depicts how the processing circuits are configured to transmit data as noted by the above limitations as split data blocks in Fig. 3, in pg. 12: Right Col. Sec. C: …The host system-on-chip (SoC) is only responsible for coordination and receiving the results. It does not send or receive data at a high bandwidth, yet we keep its serial link (Link0) always active, to make sure it can manage the other devices through that link. The other serial links, however, are turned on only when there is a data to send over them, and then turned off again…; sending data including input data, as the claimed split the first data set,  in pg. 12: Right Col:  …The camera sends the images to the memory cubes over the highlighted links in Figure 1a, and each SMC executes ResNet on one complete frame, independently from the other cubes… and coefficients, in pg. 12 Sec. C: …Each SMC has a copy of the ConvNet coefficients inside its DRAM dies, and the coefficients have been preloaded once at the beginning…];
distribute the plurality of distinct basic data blocks to a the plurality of hardware basic processing integrated circuits, wherein each of the plurality of distinct basic data blocks is distributed to one of the plurality of hardware basic processing integrated circuits and at least two hardware basic processing integrated circuits receive different basic data blocks; [Clusters configured as depicted in Fig.1 to receive data as image data, the claimed split the first data set, in pg. 12: Right Col:  …The camera sends the images to the memory cubes over the highlighted links in Figure 1a, and each SMC executes ResNet on one complete frame, independently from the other cubes…; input volumes as depicted in Fig. 3, in pg. 6 Left Col: … The input volume (e.g. the image or video frame) is loaded into this area before each run… Each cluster executes one 4D-tile (the claimed split the first data set) at a time with all its NSTs working cooperatively to compute its final result inside the cluster’s SPM… ]; 
identify a broadcast data block from the second data set; and broadcast the broadcast data block to the plurality of hardware basic processing integrated circuits, wherein each of the plurality of hardware basic processing integrated circuits receive the same broadcast data block; [same coefficient are loaded including: depicted in Fig. 3: as the convolutional kernel, as the claimed identify a broadcast data block from the second data set, in pg. 6 Left Col. Last para.: When a ConvNet such as GoogLeNet is selected for ex-ecution over our PIM system, first it is tiled using the 4D-tiling mechanism described in subsection IV-A. This proce-dure prepares it for parallel execution over the clusters, … Next, all coefficients are loaded in SMC’s DRAM (hardware basic processing integrated circuits receive the same broadcast data block)…; for performing convolution on a convolutional network layers, in pg. 2: Right. Col. Sec II(A): ConvNets are built from the connection of five classes of layers: convolutional (CONV), … CONV is the core building block of the ConvNets doing most of the computational heavy-lifting for feature extraction. It essentially consists of Multiply-and-accumulate (MAC) operations as shown below [28]: 
    PNG
    media_image1.png
    72
    610
    media_image1.png
    Greyscale
… c indexes the input channels (Cil), and K denotes the convolution kernels (a.k.a filters) (hardware basic processing integrated circuits receive the same broadcast data block)…];
each of the hardware basic processing integrated circuits is configured to: receive a  corresponding basic data block distributed by the hardware main processing integrated circuit and the broadcast data block broadcasted by the hardware main processing integrated circuit; [in pg. 2 Right Col. …ConvNets are built from the connection of five classes of layers: convolutional (CONV), activation (ACT), pooling (POOL), fully-connected (FC), and classification (CLASS)[28]. CONV is the core building block of the ConvNets doing most of the computational heavy-lifting for feature extraction. It essentially consists of Multiply-and-accumulate (MAC) operations as shown below… including convolution operations (CONV) on first and second data blocks including the input image tile (claimed corresponding basic data block distributed by the hardware main processing integrated circuit) and kernels (claimed broadcast data block broadcasted by the hardware main processing integrated circuit)  for generating output volume as depicted in Fig. 3, in pg. 6 Right Col.: ... A 4D-tile (illustrated in Figure 3a,b) is a subset of the input volume (called Input-tile) and output volume (Output-tile) of a convolutional layer (l) identified by the … the tile width and height of the input volume of layer l, and …the number of input and output channels to the tile. The output dimensions of each tile are calculated directly from input width and height, filter dimensions, striding, and zero-padding parameters. 4D-tiles have three main features essential for near-memory acceleration of deep ConvNets:…];
perform an operation in the neural network on the received basic data block and the received broadcast data block; [performing convolution operations (CONV) on first and second data blocks including the input image tile (claimed corresponding basic data block distributed by the hardware main processing integrated circuit) and kernels (claimed broadcast data block broadcasted by the hardware main processing integrated circuit)  for generating output volume as depicted in Fig. 3, in pg. 6 Right Col.: ... A 4D-tile (illustrated in Figure 3a,b) is a subset of the input volume (called Input-tile) and output volume (Output-tile) of a convolutional layer (l) identified by the … the tile width and height of the input volume of layer l, and …the number of input and output channels to the tile. The output dimensions of each tile are calculated directly from input width and height, filter dimensions, striding, and zero-padding parameters. 4D-tiles have three main features essential for near-memory acceleration of deep ConvNets:…];
 and return an operation result to the hardware main processing integrated circuit; [returned operation results to the host IC, in pg. 12: Right Col. Sec. C: …The host system-on-chip (SoC) is only responsible for coordination and receiving the results. It does not send or receive data at a high bandwidth, yet we keep its serial link (Link0) always active, to make sure it can manage the other devices through that link. The other serial links, however, are turned on only when there is a data to send over them, and then turned off again…; returned operation results for synchronization, in pg. 6 Right Col: …Each cluster executes one 4D-tile at a time with all its NSTs working cooperatively to compute its final result inside the cluster’s SPM. Only at the end of each layer the clusters are synchronized…; or as partial sums operation results returned to SMC branch circuit’s DRAMs, in pg. 7 Left. Col.: Partial Computations: Tiling of channels … requires maintaining partial computations, as more than one input tile contribute to the result of each output tile…After all input tiles have been read once, activation and pooling are directly performed on the output tile D (again inside the SPM) and then D is written back to the DRAM by the associated PE…The A, B, and C regions of T l+1 are written to DRAM after T1l, T3l, and T4,l are computed, respectively, using small DMA chunks shown in Figure 3f…]; 
the plurality of hardware basic processing integrated circuits perform respective operations in parallel; [parallelization of operations as depicted in Fig. 7 and Fig. 8B as discussed in the above limitations and in 0060.: …The operations amount to repeated matrix-matrix multiplications {A}x {B}={C} where {A}, {B} and {C} are sets of input, kernel and output matrices…. Then each matrix-matrix operation AxB is parallelized along the col­unms of B: a MAPLE PE 108 computes one element in the final matrix.];
and the hardware main processing integrated circuit is configured to perform a set of arithmetic operations in series on the operation results returned from the plurality of hardware basic processing integrated circuits.: [main Host SoC is configured to coordinate operations and receive results, in pg. 12 Right Col. Sec C: … Each SMC has a copy of the ConvNet coefficients inside its DRAM dies, and the coefficients have been preloaded once at the beginning. The host system-on-chip (SoC) is only responsible for coordination (e.g. configured to perform a set of arithmetic operations in series on the operation results) and receiving the results…; and send instructions for the coordination of the received partial sums via DRAM connected to the Host SoC  as depicted in Fig. 1, in pg. 7 Left Col: … we perform the following steps to compute each output tile: Tile A (See Figure 3d) and the related filter coefficients (KAD) are fetched from the DRAM. Then, D = D+A∗KAD is computed inside the SPM (D containing partial sums of the output channels). Next, Tile B and KBD are fetched from the DRAM, and D = D + B ∗ KBD is computed, and so forth (e.g. arithmetic operations in series on the operation results). After all input tiles have been read once, activation and pooling are directly performed on the output tile D (again inside the SPM) and then D [the operation results] is written back to the DRAM by the associated PE. This mechanism reduces DRAM’s write bandwidth and puts more pressure on read bandwidth, given that shrunk data (after pooling and strided-convolution) are written back to DRAM (e.g. configured to perform a set of arithmetic, as pooling and stride convolution, operations in series on the operation results), once after several DRAM reads…].


    PNG
    media_image2.png
    495
    368
    media_image2.png
    Greyscale


    PNG
    media_image3.png
    451
    361
    media_image3.png
    Greyscale


Regarding claim 17, the rejection of claim 1. Aza further teaches the claim limitation(s):
wherein the data comprises at least one of a vector, a matrix, a three-dimensional data block, a four-dimensional data block, or an n-dimensional data block. [clamed a four-dimensional data block, or an n-dimensional data block in pg. 7: SecIV(A): A 4D-tile (illustrated in Figure 3a,b) is a subset of the input volume…; matrix in Pg. 9 Left Col: … the NSTs using a series of STREAM MAC operations (Y is the output gradient of each layer which is propagated backward to the input X, and T stands for matrix transpose)..; vector, in pg. 10: Right Col last para: … the NSTs imple-ment the additional required functions such as argmax and vector multiply...; 3D input volume depicted in Fig. 3(a) for producing 4D convolutions in Fig. 3b]
	
Regarding independent claim 19, Aza teaches a method, implemented by a processing device, for performing operations in a neural network [in pg. 1 Right Col: …Convolutional neural networks (ConvNets) are known as the SoA ML algorithms specialized at BIC, loosely inspired by the organization of the human brain [4]. ConvNets process raw data directly, combining the classical models of feature extraction and classification into a single algorithm.,…; and pg. 2 Right Col: : …All these emerging DL models can be future targets for our PIM proposal, yet, in this paper we focus on ConvNets for image and video…],
the claim limitations are similar to those in claim 1 limitations and are therefore rejected under the same rationale.

Claim 1-2, 4, 7, 10-11, 17, and 19-20 are rejected under 35 U.S.C. 102(a)(1) and 102(a)(2) as being anticipated by Cadambi et al. (US Pub. No. 2011/0119467, hereinafter ‘Cad’).

Regarding independent claim 1 limitations, Cad teaches a processing device, comprising:
a hardware main processing integrated circuit [host general purpose processor as depicted in Fig. 4, in 0035: … The processor 400 is connected to a general-purpose host via a communication interface such as PCI. A high-bandwidth bus connects each core 100 to an off chip instruction memory bank 402…]; 
a plurality of hardware basic processing integrated circuits that are separate from the main processing integrated circuit; and [N plurality of vector processing elements (PEs), as depicted in Fig. 4 and Fig. 1in 0027: … FIG. 1, an exemplary design for a MAPLE processing core is shown. Each core 100 has p= N ·M processing elements (PEs) 108. The PEs 108 are organized as M processing chains 104, having N PEs 108 each.…]
and a plurality of [as depicted in Fig. 1: M plurality of processing chain, in 0027: Referring now to the drawings in which like numer­als represent the same or similar elements and initially to FIG. 1, an exemplary design for a MAPLE processing core is shown. Each core 100 has p= N ·M processing elements (PEs) 108. The PEs 108 are organized as M processing chains 104, having N PEs 108 each….], 
wherein each of the plurality of hardware branch processing circuits connects the main processing integrated circuit is  to a distinct subset of the plurality of hardware basic processing integrated circuits, [each processing chain IC connects the host to the subset of N PEs, as depicted in Fig. 1 and 4 , in 0027:  Referring now to the drawings in which like numer­als represent the same or similar elements and initially to FIG. 1, an exemplary design for a MAPLE processing core is shown. Each core 100 has p= N ·M processing elements (PEs) 108. The PEs 108 are organized as M processing chains 104, having N PEs 108 each…; and in 0035: Referring now to FIG. 4, an overall MAPLE accel­erator 400 is shown, comprising C processing cores 100. The processor 400 is connected to a general-purpose host via a communication interface such as PCI. A high-bandwidth bus connects each core 100 to an off chip instruction memory bank 402…]; 
 each distinct subset comprising multiple hardware basic processing integrated circuits directly connected to the corresponding hardware branch processing circuit [each subset of N PE basic circuits are directly connected to a corresponding branch processing chain circuit as depicted in Fig. 1, in  0027: … The PEs 108 are organized as M processing chains 104, having N PEs 108 each.…];  
wherein: the hardware main processing integrated circuit is configured to transmit data to the plurality of hardware branch processing circuits [Fig 1 and Fig. 4: depicts how the processing IC circuits are configured to transmit data as noted by the above limitations, in 0035: … The processor 400 is connected to a general-purpose host via a communication interface such as PCI. A high-bandwidth bus connects each core 100 to an off chip instruction memory bank 402…; and in claim 10: wherein the process­ing cores further comprise an input store that receives input data from a host and passes said input data to the processing chains.]; 
each of the plurality of hardware branch processing circuits is configured to forward the data transmitted by the hardware main processing integrated circuit to the distinct subset of the plurality of hardware basic processing integrated circuits connected thereto; [host configured to input data to the branch circuits as discussed above as depicted in Fig. 1 and Fig. 4; and  proceeding chain branch circuits are configured to forward transmitted data to the set of PE basic units as depicted Fig. 1 and Fig 4in 0027: …Each chain 104 has a bi-directional, nearest neighbor interconnect between the PEs 108 along which inputs are propagated in one direction and outputs in the other. The first PE 108-N-1 in every chain accepts inputs from an input local store 102. In an alternative embodiment, each chain 104 has a separate input buffer 102, such that a stall in one chain would not affect the other chains.]
each of the plurality of hardware basic processing integrated circuits is configured to: receive a first set of data forwarded from the connected hardware branch processing unit, wherein different hardware basic processing integrated circuits receive different first set of data  [claimed ICs configured as depicted in Fig.4 and Fig. 1 and discussed in the previous limitations to receive data sets as input data as depicted in Fig. 1, in 0028:  … Each PE 108 takes two vector operands as inputs, one from its local store 106 and the other streaming from the input buffer 102.; where the first input is different for the PE basic processing ICs, is the claimed first set of data, as depicted in Fig. 7, in [0061]: Referring now to FIG. 7, one method of paralleliza­tion is shown, where each colunm of matrix B (claimed. the claimed first set of data) is loaded into the local stores of PEs (0,0) and (1,0) (i.e., the first PEs in chains O and 1). The image rows are streamed in one by one and broadcast to the two chains, resulting in PE (0,0) com­puting colunm 0, and PE (1,0) computing colunm 1 of the output…; or parallelized as in 0061: … Another schedule is shown in the bottom of the figure where colunms O and 1 of matrix B (claimed. the claimed first set of data) are duplicated in 2 PE local stores 106 in each chain 104. Both rows are streamed in together; therefore all four output elements are computed simultaneously, making it twice as fast…; and as depicted in Fig. 8(b), in 63: FIG. Sb illustrates a parallelization mode of 1, wherein each column ofB is represented once in the process­ing elements and a single element of A is streamed at a time.… ]; 
receive a second set of data forwarded from the connected hardware branch processing circuitAtty. Dkt. No. 10015-01-0002-USReply to Office Action of-3- LIU et al.September 29, 2020Application No. 16/168,778, wherein each hardware basic processing integrated circuit receives the same second set of data [claimed ICs configured as depicted in Fig.4 and Fig. 1 and discussed in the previous limitations to receive data sets as input data as depicted in Fig. 1, in 0028:  … Each PE 108 takes two vector operands as inputs, one from its local store 106 and the other streaming from the input buffer 102.; where the second input is same for the PE basic processing ICs, is the claimed second set of data, as depicted in Fig. 7, in [0061]: Referring now to FIG. 7, one method of paralleliza­tion is shown, where each colunm of matrix B (claimed. the claimed first set of data) is loaded into the local stores of PEs (0,0) and (1,0) (i.e., the first PEs in chains O and 1). The image rows (claimed. the claimed second set of data) are streamed in one by one and broadcast to the two chains, resulting in PE (0,0) com­puting colunm 0, and PE (1,0) computing colunm 1 of the output…; or parallelized as in 0061: … Another schedule is shown in the bottom of the figure where colunms O and 1 of matrix B (claimed. the claimed first set of data) are duplicated in 2 PE local stores 106 in each chain 104. Both rows are streamed in together (Matrix A, as the claimed second set of data); therefore all four output elements are computed simultaneously, making it twice as fast…; and as depicted in Fig. 8(b), in 63: FIG. 8b illustrates a parallelization mode of 1, wherein each column of B is represented once in the process­ing elements and a single element of A (the claimed second set of data) is streamed at a time.…];
perform a set of operations on the first and second sets of data received by that hardware basic processing integrated circuit [performing convolutions on Matrix B and Matrix A, claimed first and second sets of data respectively, in 0060:. …This is repeated for On output images, each with the same In inputs but a different set of weights. MAPLE's support for data access patterns allows convolu­tions to be expressed as matrix operations. The operations amount to repeated matrix-matrix multiplications {A}x {B}={C} where {A}, {B} and {C} are sets of input, kernel and output matrices…Then each matrix-matrix operation AxB is parallelized along the col­unms of B: a MAPLE PE 108 computes one element in the final matrix.];
and return an operation result to the connected hardware branch processing circuit [returned operation results for each compute phase in the chain of N_PEs basic results, in 0037: …Each chain has a CO MP UTE phase that lasts for L cycles, where Lis the operand vector size. The COM­PUTE phase is followed by a STORE phase where the outputs of the NPEs 108 in the chain 104 are collected and stored in the memory block…]; 
the plurality of hardware basic processing integrated circuits perform the respective sets of operations in parallel [parallelization of operations as depicted in Fig. 7 and Fig. 8B as discussed in the above limitations and in 0060.: …The operations amount to repeated matrix-matrix multiplications {A}x {B}={C} where {A}, {B} and {C} are sets of input, kernel and output matrices…. Then each matrix-matrix operation AxB is parallelized along the col­unms of B: a MAPLE PE 108 computes one element in the final matrix.]; 
and the hardware main processing integrated circuit is configured to perform a set of arithmetic operations in series on the operation results forwarded from the plurality of hardware branch processing circuits [main Host configured to coordinate operations on operational results using the reduce module, as depicted Fig. 1, in 0036: Each core 100 also has its own separate instruction memory bank 402 that is written by the host…; and send instructions for the coordination of the received multiplications to compute the sums, in 0060: … The core computation in one layer is the convolution ofln input images with L, kernels and their pixel-wise summation to produce one output image. … , and in 0028-0030: …A PE chain 104 sends its outputs to its respective smart memory block 110, which can perform in-memory processing such as array ranking, finding maxima and minima, and aggregation… For example, a matrix multiplication can be imple­mented by first distributing colunms of a constant matrix to all PE local stores 106. Then the rows of a second matrix are streamed across each PE chain 104, and the result is streamed into the smart memory blocks 110... The contents of the smart memory blocks 110 can be aggregated and written to off chip storage. This implements a "reduce network" 112, by which the data from a particular location in all M smart memory blocks 110 can be operated before writing off-chip. The reduce operation may include summation (claimed arithmetic operations in series on the operation results forwarded from the plurality of hardware branch processing circuits) or finding minima or maxima.; and in series over L cycle as depicted in Fig. 8(b), in 0063: ...In the case of split colunms, the smart memory 110 will accumulate results from the PEs processing a colunm before performing its reduction opera­tion (claimed arithmetic operations in series on the operation results forwarded from the plurality of hardware branch processing circuits)…].


Regarding independent claim 2 limitations, Cad teaches a processing device, for performing [in 0022: In designing a parallel accelerator for learning and classification applications, five representative workloads are considered: Supervised Semantic Indexing, Convolutional Neural Networks,…; and in 0059-0060: Convolutional neural networks (CNNs) are 2-di­mensional neural networks used for pattern recognition… CNN classification uses ID or 2D convolutions fol­lowed by arithmetic operations and sub-sampling. The core computation in one layer is the convolution ofln input images with L, kernels and their pixel-wise summation to produce one output image…MAPLE's support for data access patterns allows convolu­tions to be expressed as matrix operations...]
using a first data set and a second data set according to an operation instruction[performing convolutions on Matrix B and Matrix A, claimed first and second sets of data respectively, in 0060:. …This is repeated for On output images, each with the same In inputs but a different set of weights. MAPLE's support for data access patterns allows convolu­tions to be expressed as matrix operations. The operations amount to repeated matrix-matrix multiplications {A}x {B}={C} where {A}, {B} and {C} are sets of input, kernel and output matrices…Then each matrix-matrix operation AxB is parallelized along the col­unms of B: a MAPLE PE 108 computes one element in the final matrix.],
the processing device comprising: a hardware main processing integrated circuit; [host general purpose processor as depicted in Fig. 4, in 0035: … The processor 400 is connected to a general-purpose host via a communication interface such as PCI. A high-bandwidth bus connects each core 100 to an off chip instruction memory bank 402…]; 
and a plurality of hardware basic processing integrated circuits; [N plurality of vector processing elements (PEs), as depicted in Fig. 4 and Fig. 1in 0027: … FIG. 1, an exemplary design for a MAPLE processing core is shown. Each core 100 has p= N ·M processing elements (PEs) 108. The PEs 108 are organized as M processing chains 104, having N PEs 108 each.…]
wherein: the hardware main processing integrated circuit is configured to: split the first data set into a plurality of distinct basic data blocks; [host configured to transmit data, as the claimed data blocks split as depicted in Fig. 7 and Fig 8, into the set of PE basic units as depicted Fig. 4 and Fig 1, in 0027: …Each chain 104 has a bi-directional, nearest neighbor interconnect between the PEs 108 along which inputs are propagated in one direction and outputs in the other…; and as data split for parallelization in [0061] Referring now to FIG. 7, one method of paralleliza­tion is shown, where each colunm of matrix B is loaded into the local stores of PEs (0,0) and (1,0) (i.e., the first PEs in chains O and 1). The image rows  are streamed in one by one and broadcast to the two chains, resulting in PE (0,0) com­puting colunm 0, and PE (1,0) computing colunm 1 of the output…; or parallelized as in 0061: … Another schedule is shown in the bottom of the figure where colunms O and 1 of matrix B are duplicated in 2 PE local stores 106 in each chain 104. Both rows are streamed in together; therefore all four output elements are computed simultaneously, making it twice as fast…;  ]
distribute the plurality of distinct basic data blocks to a the plurality of hardware basic processing integrated circuits, wherein each of the plurality of distinct basic data blocks is distributed to one of the plurality of hardware basic processing integrated circuits and at least two hardware basic processing integrated circuits receive  different basic data blocks; [where the received data blocks are different for the PE basic processing ICs, is the claimed first set of data, as depicted in Fig. 7, in [0061]: Referring now to FIG. 7, one method of paralleliza­tion is shown, where each colunm of matrix B  is loaded into the local stores of PEs (0,0) and (1,0) (i.e., the first PEs in chains O and 1) (claimed the claimed two hardware basic processing integrated circuits receive  different basic data blocks)…; or parallelized as in 0061: … Another schedule is shown in the bottom of the figure where colunms O and 1 of matrix B (different basic data blocks) are duplicated in 2 PE local stores 106 in each chain 104 (claimed the claimed two hardware basic processing integrated circuits receive  different basic data blocks)…; and as depicted in Fig. 8(b), in 63: FIG. Sb illustrates a parallelization mode of 1, wherein each column of B is represented once in the process­ing elements (claimed the claimed two hardware basic processing integrated circuits receive  different basic data blocks) and a single element of A is streamed at a time.…]; 
identify a broadcast data block from the second data set; and broadcast the broadcast data block to the plurality of hardware basic processing integrated circuits, wherein each of the plurality of hardware basic processing integrated circuits receive the same broadcast data block; [claimed ICs configured as depicted in Fig.4 and Fig. 1 and discussed in the previous limitations to receive identified data sets as input data blocks as depicted in Fig. 1, in 0028:  … Each PE 108 takes two vector operands as inputs, one from its local store 106 and the other streaming from the input buffer 102.; where the second input is same for the PE basic processing ICs, is the claimed second set of data, as depicted in Fig. 7, in [0061]: Referring now to FIG. 7, one method of paralleliza­tion is shown, where each colunm of matrix B  is loaded into the local stores of PEs (0,0) and (1,0) (i.e., the first PEs in chains O and 1). The image rows (claimed received same broadcast data block) are streamed in one by one and broadcast to the two chains (claimed each plurality of hardware basic processing integrated circuits receive the same broadcast data block to process the output), resulting in PE (0,0) com­puting colunm 0, and PE (1,0) computing colunm 1 of the output…; or parallelized as in 0061: … Another schedule is shown in the bottom of the figure where colunms O and 1 of matrix B are duplicated in 2 PE local stores 106 in each chain 104. Both rows are streamed in together (claimed plurality of hardware basic processing integrated circuits receive the same broadcast data block); therefore all four output elements are computed simultaneously, making it twice as fast…; and as depicted in Fig. 8(b), in 0063: FIG. 8b illustrates a parallelization mode of 1, wherein each column of B is represented once in the process­ing elements and a single element of A (the claimed plurality of hardware basic processing integrated circuits receive the same broadcast data block, as depicted in Fig 8b) is streamed at a time.…]
each of the hardware basic processing integrated circuits is configured to: receive a  corresponding basic data block distributed by the hardware main processing integrated circuit and the broadcast data block broadcasted by the hardware main processing integrated circuit; [claimed ICs configured as depicted in Fig.4 and Fig. 1 and discussed in the previous limitations to receive identified data sets as input data blocks as depicted in Fig. 1, in 0028:  … Each PE 108 takes two vector operands as inputs, one from its local store 106 and the other streaming from the input buffer 102.; where the second input is same for the PE basic processing ICs, is the claimed second set of data, as depicted in Fig. 7, in [0061]: Referring now to FIG. 7, one method of paralleliza­tion is shown, where each colunm of matrix B (claimed corresponding basic data block distributed by the hardware main processing integrated circuit) is loaded into the local stores of PEs (0,0) and (1,0) (i.e., the first PEs in chains O and 1). The image rows (claimed received same broadcast data block) are streamed in one by one and broadcast to the two chains (claimed broadcast data block broadcasted by the hardware main processing integrated circuit), resulting in PE (0,0) com­puting colunm 0, and PE (1,0) computing colunm 1 of the output…; or parallelized as in 0061: … Another schedule is shown in the bottom of the figure where colunms O and 1 of matrix B (claimed corresponding basic data block distributed by the hardware main processing integrated circuit) are duplicated in 2 PE local stores 106 in each chain 104. Both rows are streamed in together (claimed broadcast data block broadcasted by the hardware main processing integrated circuit); therefore all four output elements are computed simultaneously, making it twice as fast…; and as depicted in Fig. 8(b), in 63: FIG. 8b illustrates a parallelization mode of 1, wherein each column of B (claimed corresponding basic data block distributed by the hardware main processing integrated circuit) is represented once in the process­ing elements and a single element of A (claimed broadcast data block broadcasted by the hardware main processing integrated circuit) is streamed at a time.…]
perform an operation in the neural network on the received basic data block and the received broadcast data block; [performing convolutions on Matrix B and Matrix A, claimed distributed and broadcasted sets of data blocks respectively, in 0060:. …This is repeated for On output images, each with the same In inputs but a different set of weights. MAPLE's support for data access patterns allows convolu­tions to be expressed as matrix operations. The operations amount to repeated matrix-matrix multiplications {A}x {B}={C} where {A}, {B} and {C} are sets of input, kernel and output matrices…Then each matrix-matrix operation AxB is parallelized along the col­unms of B: a MAPLE PE 108 computes one element in the final matrix.];
 and return an operation result to the hardware main processing integrated circuit; [returned operation results for each compute phase in the chain of N_PEs basic results to the host IC, in 0037: …Each chain has a CO MP UTE phase that lasts for L cycles, where Lis the operand vector size. The COM­PUTE phase is followed by a STORE phase where the outputs of the NPEs 108 in the chain 104 are collected and stored in the memory block…]; 
the plurality of hardware basic processing integrated circuits perform respective operations in parallel; [parallelization of operations as depicted in Fig. 7 and Fig. 8B as discussed in the above limitations and in 0060.: …The operations amount to repeated matrix-matrix multiplications {A}x {B}={C} where {A}, {B} and {C} are sets of input, kernel and output matrices…. Then each matrix-matrix operation AxB is parallelized along the col­unms of B: a MAPLE PE 108 computes one element in the final matrix.];
and the hardware main processing integrated circuit is configured to perform a set of  [main Host configured to coordinate operations on operational results using the reduce module returned from the PEs, using SMART memory Block and Reduce module as depicted Fig. 1, in 0036: Each core 100 also has its own separate instruction memory bank 402 that is written by the host…; and send instructions for the coordination of the received multiplications to compute the sums, in 0060: … The core computation in one layer is the convolution of ln input images with L, kernels and their pixel-wise summation to produce one output image. … , and in 0028-0030: …A PE chain 104 sends its outputs to its respective smart memory block 110, which can perform in-memory processing such as array ranking, finding maxima and minima, and aggregation… For example, a matrix multiplication can be imple­mented by first distributing colunms of a constant matrix to all PE local stores 106. Then the rows of a second matrix are streamed across each PE chain 104, and the result is streamed into the smart memory blocks 110... The contents of the smart memory blocks 110 can be aggregated and written to off chip storage. This implements a "reduce network" 112, by which the data from a particular location in all M smart memory blocks 110 can be operated before writing off-chip. The reduce operation may include summation (claimed arithmetic operations in series on the operation results forwarded from the plurality of hardware branch processing circuits) or finding minima or maxima.; and in series over L cycle as depicted in Fig. 8(b), in 0063: ...In the case of split colunms, the smart memory 110 will accumulate results from the PEs processing a colunm before performing its reduction opera­tion (claimed arithmetic operations in series on the operation results forwarded from the plurality of hardware branch processing circuits)…].

Regarding claim 4, the rejection of claim 2. Cad further teaches the claim limitation(s):
further comprising a plurality of hardware branch processing circuits disposed between the hardware main processing integrated circuit and the plurality of hardware basic processing integrated circuits, [each processing chain IC connects the host to the subset of N PEs, as depicted in Fig. 1 and 4 , in 0027:  Referring now to the drawings in which like numer­als represent the same or similar elements and initially to FIG. 1, an exemplary design for a MAPLE processing core is shown. Each core 100 has p= N ·M processing elements (PEs) 108. The PEs 108 are organized as M processing chains 104, having N PEs 108 each…; and in 0035: Referring now to FIG. 4, an overall MAPLE accel­erator 400 is shown, comprising C processing cores 100. The processor 400 is connected to a general-purpose host via a communication interface such as PCI. A high-bandwidth bus connects each core 100 to an off chip instruction memory bank 402…]
the plurality of hardware branch processing circuits being configured to forward data between the hardware main processing integrated circuit and the plurality of hardware basic processing integrated circuits. [host configured to input data to the branch circuits as discussed above as depicted in Fig. 1 and Fig. 4; and  proceeding chain branch circuits are configured to forward transmitted data to the set of PE basic units as depicted Fig. 1 and Fig 4in 0027: …Each chain 104 has a bi-directional, nearest neighbor interconnect between the PEs 108 along which inputs are propagated in one direction and outputs in the other. The first PE 108-N-1 in every chain accepts inputs from an input local store 102. In an alternative embodiment, each chain 104 has a separate input buffer 102, such that a stall in one chain would not affect the other chains.]

Regarding claim 7, the rejection of claim 2. Cad further teaches the claim limitation(s):
wherein the hardware main processing integrated circuit is configured to: divide the broadcast data block into a plurality of broadcast data sub-blocks; and broadcast the plurality of broadcast data sub-blocks to the plurality of hardware basic processing integrated circuits through multiple broadcasts, wherein each broadcast transmits a same broadcast data sub-block to each of the plurality of hardware basic processing integrated circuits. [claimed ICs configured as depicted in Fig.4 and Fig. 1 and discussed in the previous limitations in claim 2 to receive identified data sets; where the data set or portioned input data sub-blocks as depicted in Fig. 7 and Fig. 8, in 0028:  … Each PE 108 takes two vector operands as inputs, one from its local store 106 and the other streaming from the input buffer 102.; where the second input is same for the PE basic processing ICs, is the claimed second set of data, as depicted in Fig. 7, in [0061]: Referring now to FIG. 7, one method of paralleliza­tion (claimed multiple broadcast operations) is shown, where each colunm of matrix B  is loaded into the local stores of PEs (0,0) and (1,0) (i.e., the first PEs in chains O and 1). The image rows (claimed received same broadcast data sub-blocks) are streamed in one by one and broadcast to the two chains (claimed each plurality of hardware basic processing integrated circuits receive the same broadcast data sub-block to process the output), resulting in PE (0,0) com­puting colunm 0, and PE (1,0) computing colunm 1 of the output…; or parallelized, as claimed multiple broadcasts,as in 0061: … Another schedule is shown in the bottom of the figure where colunms O and 1 of matrix B are duplicated in 2 PE local stores 106 in each chain 104. Both rows are streamed in together (claimed plurality of hardware basic processing integrated circuits receive the same broadcast data sub-block); therefore all four output elements are computed simultaneously, making it twice as fast…; and as depicted in Fig. 8(b), in 0063: FIG. 8b illustrates a parallelization mode of 1, wherein each column of B is represented once in the process­ing elements and a single element of A (the claimed plurality of hardware basic processing integrated circuits receive the same broadcast data sub-block, as depicted in Fig 8b) is streamed at a time.…]

Regarding claim 10, the rejection of claim 1. Cad further teaches the claim limitation(s):
wherein: Atty. Dkt. No. 10015-01-0002-USReply to Office Action of-7- LIU et al.September 29, 2020Application No. 16/168,778the hardware main processing integrated circuit comprises at least one of a main register or a main on-chip cache circuit; [in 0025: A data processing system suitable for storing and/or executing program code may include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories…; where the system incudes host memory, in 0056: In addition to the above instructions, MAPLE's API functions allow users to allocate space in the off-chip memory from the host, transfer data and programs between the host and MAPLE…]
and each of the plurality of hardware basic processing integrated circuits comprises at least one of a basic register  or a basic on-chip cache circuit. [in 0025: A data processing system suitable for storing and/or executing program code may include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories…; where the system incudes PE memory store as depicted in Fig.1, in 0028: Each PE 108 also has a private local store 106 which can be written with data from off-chip…; and in 0063 … In the case of split colunms, the smart memory 110 will accumulate results from the PEs processing a colunm before performing its reduction opera­tion…]

Regarding claim 11, the rejection of claim 1. Cad further teaches the claim limitation(s):
wherein the hardware main processing integrated circuit comprises at least one of a vector arithmetic unit circuit, an arithmetic logic unit circuit, an accumulator circuit, a matrix transpose circuit, a direct memory access circuit, or a data rearrangement circuit. [host comprising the reduce module, as claimed accumulator circuit, in 0053: …The reduce operations can be, for example, aggregations or comparisons.]
and each of the plurality of hardware basic processing integrated circuits comprises at least one of a basic register  or a basic on-chip cache circuit. [in 0025: A data processing system suitable for storing and/or executing program code may include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories…; where the system incudes PE memory store as depicted in Fig.1, in 0028: Each PE 108 also has a private local store (claimed basic on-chip cache circuit) 106 which can be written with data from off-chip…]

Regarding claim 13, the rejection of claim 1. Cad further teaches the claim limitation(s):
wherein the hardware main processing integrated circuit is connected with each of the plurality of hardware branch processing circuits, and the plurality of hardware branch processing circuits are not connected to one another. [main Host processer connected to each branch chain IC via the instruction bank as depicted in Fig. 4; where each chain  is not connected to another as depicted in Fig. 1, in 0035: … The processor  400 is  connected  to  a  general-purpose  host  via  a communication interface such as PCI. A high-bandwidth bus connects each core 100 to an off chip instruction memory bank 402. After processing, the each core 100 communicates with one of two off-chip memory banks 406… and linked via the input store that connects the host to the processing chains, in claim 10: wherein the process­ing cores further comprise an input store that receives input data from a host and passes said input data to the processing chains.   ]

Regarding claim 14, the rejection of claim 1. Cad further teaches the claim limitation(s):
wherein the plurality of hardware branch processing circuits are connected in series and at least one of the hardware branch processing circuits is connected to the hardware main processing integrated circuit. [chain branch circuits are connected in series for storing data in the smart memory blocks by branches in serial cycles for computing results as depicted in Fig. 5, in 0061: …The image rows are streamed in one by one and broadcast to the two chains, resulting in PE (0,0) com­puting colunm 0, and PE (1,0) computing colunm 1 of the output...; and in 0037-0039: As noted above, storing data in the smart memory blocks 110 can take many cycles. Referring now to FIG. 5, a stall mechanism is shown. FIG. 5 shows the M chains 104 of a MAPLE core. Each chain has a CO MP UTE phase that lasts for L cycles, where Lis the operand vector size. The COM­PUTE phase is followed by a STORE phase where the outputs of the NPEs 108 in the chain 104 are collected and stored in the memory block. Successive chains are separated by one cycle due to pipelining (claimed hardware branch processing circuits are connected in series)... The smart memory 110 triggers a stall when it uses a variable latency store, such as that shown above in FIG. 3… The OR gate 502 in FIG. 5 broadcasts a global STALL signal generated from the individual chains If one chain stalls, its input must be stalled-then all of the chains are stalled in order to preserve the order of processing (claimed hardware branch processing circuits is connected to the hardware main processing integrated circuit for series operations in a processing order using stall command/instructions).]

Regarding claim 15, the rejection of claim 13. Cad further teaches the claim limitation(s):
wherein the plurality of hardware branch processing circuits are configured to receive the data transmitted by the hardware main processing integrated circuit directly. [Fig 1 and Fig. 4: depicts how the processing IC circuits are configured to directly transmit data as noted by the above limitations, in 0035: … The processor 400 is connected to a general-purpose host via a communication interface such as PCI. A high-bandwidth bus connects each core 100 to an off chip instruction memory bank 402…; and in claim 10: wherein the process­ing cores further comprise an input store that receives input data from a host and passes said input data to the processing chains (Claimed plurality of hardware branch processing circuits are configured to receive the data transmitted by the hardware main processing integrated circuit directly)...]

Regarding claim 16, the rejection of claim 14. Cad further teaches the claim limitation(s):
wherein at least one of the plurality of hardware branch processing circuits is configured to forward the data Atty. Dkt. No. 10015-01-0002-USReply to Office Action of- 8 - LIU et al. September 29, 2020Application No. 16/168,778transmitted by the hardware main processing integrated circuit to another one of the plurality of hardware branch processing circuits connected thereto. [in 0039: The OR gate 502 in FIG. 5 broadcasts a global STALL signal (claimed the data  transmitted by the hardware main processing integrated circuit) generated from the individual chains (Claimed at least one of the plurality of hardware branch processing circuits is configured to forward the data Atty. Dkt. No. 10015-01-0002-USReply to Office Action of- 8 - LIU et al. September 29, 2020Application No. 16/168,778transmitted by the hardware main processing integrated circuit to another one of the plurality of hardware branch processing ). This signal can be pipelined, since it only has to stall the next input vector and can therefore reach the first chain as its current COMPUTE cycle completes. The global stall is used because all of the chains process a common pipelined input that streams from the input local store...]

Regarding claim 17, the rejection of claim 1. Cad further teaches the claim limitation(s):
wherein the data comprises at least one of a vector, a matrix, a three-dimensional data block, a four-dimensional data block, or an n-dimensional data block. [ matrix and vector, in 0028-0029: … Each PE 108 takes two vector operands as inputs, one from its local store 106 and the other streaming from the input buffer 102. [0029] For example, a matrix multiplication can be implemented by first distributing colunms of a constant matrix to all PE local stores 106. Then the rows of a second matrix are streamed across each PE chain 104, and the result is streamed into the smart memory blocks 110…]

Regarding claim 18, the rejection of claim 2. Cad further teaches the claim limitation(s):
wherein: the broadcast data block is used as a multiplier data block and the distribution data block is used as a multiplicand data block, when the operation instruction is a multiplication instruction; [the second input, as the claimed broadcast block,  as depicted in Fig. 7, in [0061]: Referring now to FIG. 7, one method of paralleliza­tion is shown, where each colunm of matrix B (claimed distribution data block is used as a multiplicand data block)  is loaded into the local stores of PEs (0,0) and (1,0) (i.e., the first PEs in chains O and 1). The image rows (claimed broadcast data block is used as a multiplier data block) are streamed in one by one and broadcast to the two chains, resulting in PE (0,0) com­puting colunm 0, and PE (1,0) computing colunm 1 of the output… Another schedule is shown in the bottom of the figure where colunms O and 1 of matrix B are duplicated in 2 PE local stores 106 in each chain 104. Both rows are streamed in together; therefore all four output elements are computed simultaneously, making it twice as fast…; and as depicted in Fig. 8(b), in 0063: FIG. 8b illustrates a parallelization mode of 1, wherein each column of B is represented once in the process­ing elements and a single element of A (the claimed plurality of hardware basic processing integrated circuits receive the same broadcast data block, as depicted in Fig 8b) is streamed at a time.…; when the instruction for convolution operations, in 0060: … This is repeated for On output images, each with the same In inputs but a different set of weights. MAPLE's support for data access patterns allows convolu­tions to be expressed as matrix operations. The operations amount to repeated matrix-matrix multiplications {A}x {B}={C} where {A}, {B} and {C} are sets of input, kernel and output matrices (claimed when the operation instruction is a multiplication instruction).]
and the broadcast data block is used as an input data block and the distribution data block is used as a convolution kernel, when the operation instruction is a convolution instruction. [the second input, as the claimed broadcast block,  as depicted in Fig. 7, in [0061]: Referring now to FIG. 7, one method of paralleliza­tion is shown, where each colunm of matrix B (claimed distribution data block)  is loaded into the local stores of PEs (0,0) and (1,0) (i.e., the first PEs in chains O and 1). The image rows (claimed broadcast data block is used as a multiplier data block) are streamed in one by one and broadcast to the two chains, resulting in PE (0,0) com­puting colunm 0, and PE (1,0) computing colunm 1 of the output… Another schedule is shown in the bottom of the figure where colunms O and 1 of matrix B are duplicated in 2 PE local stores 106 in each chain 104. Both rows are streamed in together; therefore all four output elements are computed simultaneously, making it twice as fast…; and as depicted in Fig. 8(b), in 0063: FIG. 8b illustrates a parallelization mode of 1, wherein each column of B is represented once in the process­ing elements and a single element of A (the claimed plurality of hardware basic processing integrated circuits receive the same broadcast data block, as depicted in Fig 8b) is streamed at a time.…; when the instruction for convolution operations, in 0060: … This is repeated for On output images, each with the same In inputs but a different set of weights. MAPLE's support for data access patterns allows convolu­tions to be expressed as matrix operations. The operations amount to repeated matrix-matrix multiplications {A}x {B} ={C} where {A}, {B} (claimed distribution data block used as a convolution kernel) and {C} are sets of input, kernel and output matrices (claimed when the operation instruction is a multiplication instruction).]

Regarding independent claim 19, Cad teaches a method, implemented by a processing device, for performing operations in a neural network [in pg. 1 Right Col: …Convolutional neural networks (ConvNets) are known as the SoA ML algorithms specialized at BIC, loosely inspired by the organization of the human brain [4]. ConvNets process raw data directly, combining the classical models of feature extraction and classification into a single algorithm.,…; and pg. 2 Right Col: : …All these emerging DL models can be future targets for our PIM proposal, yet, in this paper we focus on ConvNets for image and video…],
the claim limitations are similar to those in claim 1 limitations and are therefore rejected under the same rationale.

Regarding claim 20, the rejection of claim 19. Cad further teaches the claim limitation(s):
further comprising: dividing the data, by the hardware main processing integrated circuit, into a distribution data block and a broadcast data block according to an operation instruction;  [the instruction for convolution operations, in 0060: … This is repeated for On output images, each with the same In inputs but a different set of weights. MAPLE's support for data access patterns allows convolu­tions to be expressed as matrix operations. The operations amount to repeated matrix-matrix multiplications {A}x {B} ={C} where {A}, {B} (claimed distribution data block used as a convolution kernel) and {C} are sets of input, kernel and output matrices (claimed distribution data block and a broadcast data block according to an operation instruction).]
splitting the distribution data block into a plurality of distinct basic data blocks; distributing each of the plurality of distinct basic data blocks to aAtty. Dkt. No. 10015-01-0002-USReply to Office Action of- 10- LIU et al.September 29, 2020Application No. 16/168,778 corresponding hardware basic processing integrated circuit as the first set of data; and broadcasting the broadcast data block to each of the plurality of hardware basic processing integrated circuit as the second set of data. [splitting data to distinct blocks for processing,  as depicted in Fig. 7, in [0061]: Referring now to FIG. 7, one method of paralleliza­tion (claimed splitting distribution process) is shown, where each colunm of matrix B (claimed distribution data block into a plurality of distinct basic data blocks)  is loaded into the local stores of PEs (0,0) and (1,0) (i.e., the first PEs in chains O and 1). The image rows (claimed broadcast data block to each of the plurality of hardware basic processing integrated circuit as the second set of data) are streamed in one by one and broadcast to the two chains, resulting in PE (0,0) com­puting colunm 0, and PE (1,0) computing colunm 1 of the output… Another schedule is shown in the bottom of the figure where colunms O and 1 of matrix B are duplicated in 2 PE local stores 106 in each chain 104. Both rows are streamed in together; therefore all four output elements are computed simultaneously, making it twice as fast…; and as depicted in Fig. 8(b), in 0063: FIG. 8b illustrates a parallelization mode of 1, wherein each column of B is represented once in the process­ing elements and a single element of A (the claimed plurality of hardware basic processing integrated circuits receive the same broadcast data block, as depicted in Fig 8b) is streamed at a time.…;]

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim 3, 5-6, 8-9, and 12 are rejected under 35 U.S.C. 103 as being unpatentable over Cadambi et al. (US Pub. No. 2011/0119467, hereinafter ‘Cad’ in view of Guevorkian et al. (US Pub. No. 2010/0122070, hereinafter ‘David’).

Regarding claim 3, the rejection of claim 2. Cad further teaches the claim limitation(s):
wherein: the operation includes an inner-product operation and the operation result includes an inner- product operation result; and each of the plurality of hardware basic processing integrated circuits is further configured to: obtain the inner-product operation result by performing the Atty. Dkt. No. 10015-01-0002-USReply to Office Action of-5 - LIU et al.September 29, 2020Application No. 16/168,778inner-product operation between the received basic data block and the received broadcast data block;  [coordination of multiplications to compute the sums, operations that include obtaining product results as convolution operations, in 0060: … The core computation in one layer is the convolution of ln input images with L, kernels and their pixel-wise summation to produce one output image…; in [0061]: Referring now to FIG. 7, one method of paralleliza­tion is shown, where each colunm of matrix B (claimed corresponding basic data block distributed by the hardware main processing integrated circuit) is loaded into the local stores of PEs (0,0) and (1,0) (i.e., the first PEs in chains O and 1). The image rows (claimed received same broadcast data block) are streamed in one by one and broadcast to the two chains (claimed broadcast data block broadcasted by the hardware main processing integrated circuit), resulting in PE (0,0) com­puting colunm 0, and PE (1,0) computing colunm 1 of the output…; or parallelized as in 0061: … Another schedule is shown in the bottom of the figure where colunms O and 1 of matrix B (claimed corresponding basic data block distributed by the hardware main processing integrated circuit) are duplicated in 2 PE local stores 106 in each chain 104. Both rows are streamed in together (claimed broadcast data block broadcasted by the hardware main processing integrated circuit); therefore all four output elements are computed simultaneously, making it twice as fast…; and as depicted in Fig. 8(b), in 63: FIG. 8b illustrates a parallelization mode of 1, wherein each column of B (claimed corresponding basic data block distributed by the hardware main processing integrated circuit) is represented once in the process­ing elements and a single element of A (claimed broadcast data block broadcasted by the hardware main processing integrated circuit) is streamed at a time.…]
and return the inner product operation result to the hardware main processing integrated circuit. [returned operation results for each compute phase in the chain of N_PEs basic results to the host IC, in 0037: …Each chain has a CO MP UTE phase that lasts for L cycles, where Lis the operand vector size. The COM­PUTE phase is followed by a STORE phase where the outputs of the NPEs 108 in the chain 104 are collected and stored in the memory block…; and in 0059-0060: Convolutional neural networks (CNNs) are 2-di­mensional neural networks used for pattern recognition. A CNN uses small 2-D arrays of learned weights ("kernels") that are convolved with input images to produce output images... CNN classification uses ID or 2D convolutions fol­lowed by arithmetic operations and sub-sampling. The core computation in one layer is the convolution of  ln input images with L, kernels and their pixel-wise summation to produce one output image. This is repeated for On output images, each with the same In inputs but a different set of weights (claimed the inner-product operation result by performing the inner-product operation between the received basic data block and the received broadcast data block). MAPLE's support for data access patterns allows convolu­tions to be expressed as matrix operations. The operations amount to repeated matrix-matrix multiplications {A}x {B}={C} where {A}, {B} and {C} are sets of input, kernel and output matrices (claimed returnrd the inner-product operation result to the hardware main processing integrated circuit) ]
While Cad teaches the process for executing operations of a convolution neural network as disclose above as convolution operators on the respective data blocks using multiplexer and ADDR as 
While Cad does not expressly disclosed the convolution operations as inner-product operations.
David teaches the use of inner-product operations as disclosed by the claim limitations:
wherein: the operation includes an inner-product operation and the operation result includes an inner- product operation result; and…: obtain the inner-product operation result by performing the Atty. Dkt. No. 10015-01-0002-USReply to Office Action of-5 - LIU et al.September 29, 2020Application No. 16/168,778inner-product operation between the received basic data block and the received broadcast data block; and return the inner-product operation result… [inner product operations, in 0034: … It appears that the most common method for implementing inner product calculations is based on performing multiplications and additions or multiply-accumulate operations on traditional multipliers and adders or multiply accumulate units…; performing inner product operation amount input data sets, in 0047: … Specifically, these teachings detail a new high-performance approach for massively parallel implementation of computations. Examples of where such large matrix-vector computations may be implemented include matrix-vector product, FIR filtering, convolution, and discrete orthogonal transforms, to name a few...; and in  0036-0039: … In the data storage array there are subvector slices x(i,r,s) of a first vector x(i) which are stored in a bit-parallel word-serial manner. The processor is configured to execute an operation, on each of the stored subvector slices and in parallel on bits of said each subvector slice, that outputs a pre-calculated inner product result (claimed returned result by performing the Atty. Dkt. No. 10015-01-0002-USReply to Office Action of-5 - LIU et al.September 29, 2020Application No. 16/168,778inner-product operation between the received basic data block and the received broadcast data block) of the said bits and a second vector a.]

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to integrate the method for performing multiplications and additions or multiply-accumulate operations in an parallel processing systems using multiply accumulate units  as disclosed by Cad with the method of obtaining inner product results by performing multiplications and additions or multiply-accumulate operations on multipliers and adders or multiply accumulate units as disclosed by David.
One of ordinary skill in the arts would have been motivated to combine the teaches in Cad and David to improve the distribute the processing of parallel inner-products in an efficient manner and reduce computational complexity (David, 0047 and  0096-0097 ). Thus improving the computational efficiency of the inner-product operations when using more than one multiply accumulate units.

Regarding claim 5, the rejection of claim 2. Cad further teaches the claim limitation(s):
wherein each of the plurality of hardware basic processing circuits is configured to: obtain an inner-product operation result by performing an inner- product operation between the corresponding basic data block and the broadcast data block; and obtain the operation result by performing an accumulation operation of the inner-product operation result. [coordination of multiplications to compute the sums, operations that include obtaining product results, in 0060: … The core computation in one layer is the convolution of ln input images with L, kernels and their pixel-wise summation to produce one output image…; in [0061]: Referring now to FIG. 7, one method of paralleliza­tion is shown, where each colunm of matrix B (claimed corresponding basic data block distributed by the hardware main processing integrated circuit) is loaded into the local stores of PEs (0,0) and (1,0) (i.e., the first PEs in chains O and 1). The image rows (claimed received same broadcast data block) are streamed in one by one and broadcast to the two chains (claimed broadcast data block broadcasted by the hardware main processing integrated circuit), resulting in PE (0,0) com­puting colunm 0, and PE (1,0) computing colunm 1 of the output…; or parallelized as in 0061: … Another schedule is shown in the bottom of the figure where colunms O and 1 of matrix B (claimed corresponding basic data block distributed by the hardware main processing integrated circuit) are duplicated in 2 PE local stores 106 in each chain 104. Both rows are streamed in together (claimed broadcast data block broadcasted by the hardware main processing integrated circuit); therefore all four output elements are computed simultaneously, making it twice as fast…; and as depicted in Fig. 8(b), in 63: FIG. 8b illustrates a parallelization mode of 1, wherein each column of B (claimed corresponding basic data block distributed by the hardware main processing integrated circuit) is represented once in the process­ing elements and a single element of A (claimed broadcast data block broadcasted by the hardware main processing integrated circuit) is streamed at a time.…; and obtain returned operation results for each compute phase in the chain of N_PEs basic results to the host IC, in 0037: …Each chain has a CO MP UTE phase that lasts for L cycles, where Lis the operand vector size. The COM­PUTE phase is followed by a STORE phase where the outputs  of the NPEs 108 in the chain 104 are collected and stored in the memory block… and in 0031: Referring now to FIG. 2, the PEs 108 are shown in greater detail. Each PE 108 performs arithmetic logic unit (ALU) and multiply operations, as well as a multiple-accu­mulate operation in a single cycle (claimed obtained operation result by performing an accumulation operation of the inner-product operation result in a cycle)…]
While Cad teaches the process for executing operations of a convolution neural network as disclose above as convolution operators on the respective data blocks using multiplexer and ADDR as depicted in Fig 2 and Fig. 3 to produce an output data block as depicted in Fig. 7, as recited by in the claim limitations above.
Cad does not expressly disclosed the convolution operations as inner-product operations.
David teaches the use of inner-product operations as disclosed by the claim limitations:
obtain an inner-product operation result by performing an inner- product operation between the corresponding basic data block and the broadcast data block; and obtain the operation result by performing an accumulation operation of the inner-product operation result [inner product operations, in 0034: … It appears that the most common method for implementing inner product calculations is based on performing multiplications and additions or multiply-accumulate operations on traditional multipliers and adders or multiply accumulate units (performing an accumulation operation)…; performing inner product operation amount input data sets, in 0047: … Specifically, these teachings detail a new high-performance approach for massively parallel implementation of computations. Examples of where such large matrix-vector computations may be implemented include matrix-vector product, FIR filtering, convolution, and discrete orthogonal transforms, to name a few...; and in  0036-0039: … In the data storage array there are subvector slices x(i,r,s) of a first vector x(i) which are stored in a bit-parallel word-serial manner. The processor is configured to execute an operation, on each of the stored subvector slices and in parallel on bits of said each subvector slice, that outputs a pre-calculated inner product result (claimed returned result by performing the Atty. Dkt. No. 10015-01-0002-USReply to Office Action of-5 - LIU et al.September 29, 2020Application No. 16/168,778inner-product operation between the received basic data block and the received broadcast data block) of the said bits and a second vector a.; and in 0090: … The binary inner products shown at FIG. 4d are summed (claimed obtained operation result by performing an accumulation operation of the inner-product operation result) up according to equation (8) in the general case or according to equation (9) in the case of FIR filtering type of operations (moving widow inner products). Different summation proce­dures may be applied. One specific implementation to do this summation utilizes an adder tree principle, and is illustrated at FIG. 4e.]
The Cad and David references would have been recognized by those of ordinary skill in the art as useful for applicant’s purpose in developing information processing products for parallel processing applications.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to integrate the method for performing multiplications and additions or multiply-accumulate operations in an parallel processing systems using multiply accumulate units  as disclosed by Cad with the method of obtaining inner product results by performing multiplications and additions or multiply-accumulate operations as disclosed by David.
One of ordinary skill in the arts would have been motivated to combine the teaches in Cad and David to improve the distribute the processing of parallel inner-products in an efficient manner and reduce computational complexity (David, 0047 and  0096-0097 ). Thus improving the computational efficiency of the inner-product operations when using more than one multiply accumulate units.

Regarding claim 6, the rejection of claim 3. Cad in combination with David further teaches the claim limitation(s):
wherein the hardware main processing integrated circuit is configured to: obtain an accumulated result by performing an accumulation operation of the inner-product operation result received from each of the plurality of hardware basic processing integrated circuits; and obtain, an instruction result corresponding to the operation instruction by arranging the accumulated results. [coordination of multiplications to compute the sums, operations that include obtaining product results, in 0060: … The core computation in one layer is the convolution of ln input images with L, kernels and their pixel-wise summation to produce one output image (claimed instruction result corresponding to the operation instruction by arranging the accumulated results)…; and obtain returned operation results for each compute phase in the chain of N_PEs basic results to the host IC, as claimed instruction result, in 0037: …Each chain has a CO MP UTE phase that lasts for L cycles (claimed operation instruction by arranging the accumulated results into cycles for the PE units), where Lis the operand vector size. The COM­PUTE phase is followed by a STORE phase where the outputs  of the NPEs 108 in the chain 104 are collected and stored in the memory block… and in 0031-0032: Referring now to FIG. 2, the PEs 108 are shown in greater detail. Each PE 108 performs arithmetic logic unit (ALU) and multiply operations, as well as a multiple-accu­mulate operation in a single cycle (claimed obtained operation result by performing an accumulation operation of the inner-product operation result in a cycle)… The PEs 108 store outputs to their smart memory block via the intra-chain interconnect and can continue pro­cessing the next vector in the next cycle. For some embodi­ments, the PEs 108 may also store their output in their respec­tive local stores 106. Unless the smart memory block 110 issues a stall, a store operation takes N cycles, as the outputs from each PE 108 arrive (claimed instruction result corresponding to the operation instruction by arranging the accumulated results) .]
While Cad teaches the process for executing operations of a convolution neural network as disclose above as convolution operators on the respective data blocks using multiplexer and ADDR as depicted in Fig 2 and Fig. 3 to produce an output data block as depicted in Fig. 7, as recited by in the claim limitations above.
Cad does not expressly disclosed the convolution operations as inner-product operations.
David teaches the use of inner-product operations as disclosed by the claim limitations:
obtain an accumulated result by performing an accumulation operation of the inner-product operation result received …; [inner product operations, in 0034: … It appears that the most common method for implementing inner product calculations is based on performing multiplications and additions or multiply-accumulate operations on traditional multipliers and adders or multiply accumulate units (performing an accumulation operation)…; performing inner product operation amount input data sets, in 0047: … Specifically, these teachings detail a new high-performance approach for massively parallel implementation of computations. Examples of where such large matrix-vector computations may be implemented include matrix-vector product, FIR filtering, convolution, and discrete orthogonal transforms, to name a few...; and in  0036-0039: … In the data storage array there are subvector slices x(i,r,s) of a first vector x(i) which are stored in a bit-parallel word-serial manner. The processor is configured to execute an operation, on each of the stored subvector slices and in parallel on bits of said each subvector slice, that outputs a pre-calculated inner product result (claimed returned result by performing the Atty. Dkt. No. 10015-01-0002-USReply to Office Action of-5 - LIU et al.September 29, 2020Application No. 16/168,778inner-product operation between the received basic data block and the received broadcast data block) of the said bits and a second vector a.; and in 0090: … The binary inner products shown at FIG. 4d are summed (claimed obtained operation result by performing an accumulation operation of the inner-product operation result) up according to equation (8) in the general case or according to equation (9) in the case of FIR filtering type of operations (moving widow inner products). Different summation proce­dures may be applied. One specific implementation to do this summation utilizes an adder tree principle, and is illustrated at FIG. 4e.]
The Cad and David references would have been recognized by those of ordinary skill in the art as useful for applicant’s purpose in developing information processing products for parallel processing applications.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to integrate the method for performing multiplications and additions or multiply-accumulate operations in an parallel processing systems using multiply accumulate units  as disclosed by Cad with the method of obtaining inner product results by performing multiplications and additions or multiply-accumulate operations as disclosed by David.
One of ordinary skill in the arts would have been motivated to combine the teaches in Cad and David to improve the distribute the processing of parallel inner-products in an efficient manner and reduce computational complexity (David, 0047 and  0096-0097 ). Thus improving the computational efficiency of the inner-product operations when using more than one multiply accumulate units.

Regarding claim 8, the rejection of claim 2. Cad further teaches the claim limitation(s):
wherein each of the plurality of hardware basic processing integrated circuits is configured to: obtain an inner-product operation result by performing an inner- product operation between each broadcast data sub-block and the respective basic data block;  [the host configured as in claim 2  and 7 limitations for coordination of multiplications to compute the sums, operations that include obtaining product results as convolution operations, in 0060: … The core computation in one layer is the convolution of ln input images with L, kernels and their pixel-wise summation to produce one output image…; in [0061]: Referring now to FIG. 7, one method of paralleliza­tion is shown, where each colunm of matrix B (claimed respective basic data block distributed by the hardware main processing integrated circuit) is loaded into the local stores of PEs (0,0) and (1,0) (i.e., the first PEs in chains O and 1). The image rows (claimed received same broadcast sub-blocks as indexed in Fig. 7 and Fig. 8b) are streamed in one by one and broadcast to the two chains, resulting in PE (0,0) com­puting colunm 0, and PE (1,0) computing colunm 1 of the output…; or parallelized as in 0061: … Another schedule is shown in the bottom of the figure where colunms O and 1 of matrix B are duplicated in 2 PE local stores 106 in each chain 104. Both rows are streamed in together; therefore all four output elements are computed simultaneously, making it twice as fast…; and as depicted in Fig. 8(b), in 63: FIG. 8b illustrates a parallelization mode of 1, wherein each column of B (claimed respective basic data block) is represented once in the process­ing elements and a single element of A (claimed broadcast data sub- block broadcasted as depicted in Fig. 7 and Fig. 8b) is streamed at a time.…; as sub-blocks in sub-sampled layers of a CNN, in 0059: … with n kernels is given by O,=tan h(bias…) where I/K1, represents the convolution operation between image I1 and kernel K1,. O, may be sub-sampled afterwards. All this constitutes one of several "layers" of a CNN…]
and return the operation sub-result to the hardware main unit processing integrated circuit. [returned operation results for each compute phase in the chain of N_PEs basic results to the host IC, in 0037: …Each chain has a CO MP UTE phase that lasts for L cycles, where Lis the operand vector size. The COM­PUTE phase is followed by a STORE phase where the outputs of the NPEs 108 in the chain 104 are collected and stored in the memory block…; and in 0059-0060: Convolutional neural networks (CNNs) are 2-di­mensional neural networks used for pattern recognition. A CNN uses small 2-D arrays of learned weights ("kernels") that are convolved with input images to produce output images... CNN classification uses ID or 2D convolutions fol­lowed by arithmetic operations and sub-sampling. The core computation in one layer is the convolution of  ln input images with L, kernels and their pixel-wise summation to produce one output image. This is repeated for On output images, each with the same In inputs but a different set of weights (claimed the inner-product operation result by performing the inner-product operation between the received basic data block and the received broadcast data sub-block). MAPLE's support for data access patterns allows convolu­tions to be expressed as matrix operations. The operations amount to repeated matrix-matrix multiplications {A}x {B}={C} where {A}, {B} and {C} are sets of input, kernel and output matrices (claimed returnrd the inner-product operation result to the hardware main processing integrated circuit) ]
While Cad teaches the process for executing operations of a convolution neural network as disclose above as convolution operators on the respective data blocks using multiplexer and ADDR as depicted in Fig 2 and Fig. 3 to produce an output data block as depicted in Fig. 7, as recited by in the claim limitation above.
While Cad does not expressly disclosed the convolution operations as inner-product operations.

obtain an inner-product operation result by performing an inner- product operation … obtain an operation sub-result by performing an accumulation operation of the inner-product operation result; and return the inner-product operation result… [inner product operations, in 0034: … It appears that the most common method for implementing inner product calculations is based on performing multiplications and additions or multiply-accumulate operations on traditional multipliers and adders or multiply accumulate units…; performing inner product operation amount input data sets, in 0047: … Specifically, these teachings detail a new high-performance approach for massively parallel implementation of computations. Examples of where such large matrix-vector computations may be implemented include matrix-vector product, FIR filtering, convolution, and discrete orthogonal transforms, to name a few...; and in  0036-0039: … In the data storage array there are subvector slices x(i,r,s) of a first vector x(i) which are stored in a bit-parallel word-serial manner. The processor is configured to execute an operation, on each of the stored subvector slices and in parallel on bits of said each subvector slice, that outputs a pre-calculated inner product result (claimed returned result by performing the Atty. Dkt. No. 10015-01-0002-USReply to Office Action of-5 - LIU et al.September 29, 2020Application No. 16/168,778inner-product operation between data blocks) of the said bits and a second vector a.; and as sub-blocks in sub-sampled layers of a CNN, in 0059: … with n kernels is given by O,=tan h(bias…) where I/K1, represents the convolution operation between image I1 and kernel K1,. O, may be sub-sampled afterwards. All this constitutes one of several "layers" of a CNN…]]
The Cad and David references would have been recognized by those of ordinary skill in the art as useful for applicant’s purpose in developing information processing products for parallel processing applications.
 to integrate the method for performing multiplications and additions or multiply-accumulate operations in an parallel processing systems using multiply accumulate units  as disclosed by Cad with the method of obtaining inner product results by performing multiplications and additions or multiply-accumulate operations on multipliers and adders or multiply accumulate units as disclosed by David.
One of ordinary skill in the arts would have been motivated to combine the teaches in Cad and David to improve the distribute the processing of parallel inner-products in an efficient manner and reduce computational complexity (David, 0047 and  0096-0097 ). Thus improving the computational efficiency of the inner-product operations when using more than one multiply accumulate units.


Regarding claim 9, the rejection of claim 8. Cad in combination with David further teaches the claim limitation(s):
wherein each of the plurality of hardware basic processing integrated circuits is configured to: obtain n processing sub-results by multiplexing each of the broadcast data sub-blocks n times and perform inner-product operations between the broadcast data sub- blocks and n basic data blocks; obtain n operation sub-results by performing accumulation operations of the n processing sub-results respectively;  [the host configured as in claim 2  and 7 limitations for coordination of multiplications to compute the sums, operations that include obtaining product results as convolution operations, in 0031: … Each PE 108 performs arithmetic logic unit (ALU) and multiply operations, as well as a multiple-accu­mulate operation in a single cycle…. and in 0060: … The core computation in one layer is the convolution of ln input images with L, kernels and their pixel-wise summation to produce one output image (claimed n operation sub-results by performing accumulation operations as depicted in Fig. 1)…; in 0061: Referring now to FIG. 7, one method of paralleliza­tion is shown, where each colunm of matrix B (claimed respective basic data block distributed by the hardware main processing integrated circuit) is loaded into the local stores of PEs (0,0) and (1,0) (i.e., the first PEs in chains O and 1). The image rows (claimed received same broadcast sub-blocks as indexed in Fig. 7 and Fig. 8b) are streamed in one by one and broadcast to the two chains, resulting in PE (0,0) com­puting colunm 0, and PE (1,0) computing colunm 1 of the output…; and as sub-blocks in sub-sampled layers of a CNN, in 0059: … with n kernels is given by O,=tan h(bias…) where I/K1, represents the convolution operation between image I1 and kernel K1,. O, may be sub-sampled afterwards. All this constitutes one of several "layers" of a CNN…]
and return the n operation sub-results to the hardware main unit processing integrated circuit, and the wherein  n is an integer greater than or equal to 2. [returned operation results for each compute phase in the chain of N_PEs basic results to the host IC, where N is greater than  or equal to 2 as depicted in Fig 1, in 0037: …Each chain has a CO MP UTE phase that lasts for L cycles, where Lis the operand vector size. The COM­PUTE phase is followed by a STORE phase where the outputs of the NPEs 108 in the chain 104 are collected and stored in the memory block…; and in 0059-0060: Convolutional neural networks (CNNs) are 2-di­mensional neural networks used for pattern recognition. A CNN uses small 2-D arrays of learned weights ("kernels") that are convolved with input images to produce output images... CNN classification uses ID or 2D convolutions fol­lowed by arithmetic operations and sub-sampling. The core computation in one layer is the convolution of  ln input images with L, kernels and their pixel-wise summation to produce one output image… The operations amount to repeated matrix-matrix multiplications {A}x {B}={C} where {A}, {B} and {C} are sets of input, kernel and output matrices (claimed returnrd the inner-product operation result to the hardware main processing integrated circuit)… as depicted in Fig. 7, in [0061]: Referring now to FIG. 7, one method of paralleliza­tion is shown, where each colunm of matrix B (claimed. the claimed first set of data) is loaded into the local stores of PEs (0,0) and (1,0) (i.e., the first PEs in chains O and 1). The image rows (claimed. the claimed second set of data) are streamed in one by one and broadcast to the two chains, resulting in PE (0,0) com­puting colunm 0, and PE (1,0) computing colunm 1 of the output. Another schedule is shown in the bottom of the figure where colunms O and 1 of matrix B are duplicated in 2 PE local stores 106 in each chain 104…; and executing workloads with N greater than or equal to 2, in 0048: To execute such workloads on MAPLE, host-to­MAPLE communication may be overlapped with MAPLE's execution. Specifically, the C processor-memory channels may be partitioned into two groups of C/2 channels each. Then A and B are divided into in and n chunks: A={A1 , A2,. .. , Am } and B={B1 , B2, ... , Bn }, thereby splitting the entire operation AB into inn smaller operations.]
While Cad teaches the process for executing operations of a convolution neural network as disclose above as convolution operators on the respective data blocks using multiplexer and ADDR as depicted in Fig 2 and Fig. 3 to produce an output data block as depicted in Fig. 7, as recited by in the claim limitation above.
While Cad does not expressly disclosed the convolution operations as inner-product operations.
David teaches the use of inner-product operations as disclosed by the claim limitations:
obtain n processing sub-results by multiplexing each of the broadcast data sub-blocks n times and perform inner-product operations between the broadcast data sub- blocks  [inner product operations on data sub-blocks as depicted in Fig. 4d, in 0034: … It appears that the most common method for implementing inner product calculations is based on performing multiplications and additions or multiply-accumulate operations on traditional multipliers and adders or multiply accumulate units…; performing inner product operation amount input data sets, in 0047: … Specifically, these teachings detail a new high-performance approach for massively parallel implementation of computations. Examples of where such large matrix-vector computations may be implemented include matrix-vector product, FIR filtering, convolution, and discrete orthogonal transforms, to name a few...; and in  0036-0039: … In the data storage array there are subvector slices x(i,r,s) of a first vector x(i) which are stored in a bit-parallel word-serial manner. The processor is configured to execute an operation, on each of the stored subvector slices and in parallel on bits of said each subvector slice, that outputs a pre-calculated inner product result (claimed returned result by performing the Atty. Dkt. No. 10015-01-0002-USReply to Office Action of-5 - LIU et al.September 29, 2020Application No. 16/168,778inner-product operation between data sub-blocks) of the said bits and a second vector a. where n is greater than or equal to 2, in 0011: For example, in order to pairwise add N pairs of m-bit integers (N being less than or equal to the number of CAM rows), the following algorithm may be used. Assume the corresponding pairs are written in CAM memory, one pair in a row manner and occupy bits O to 2m-1. Also assume that outputs (the pairwise sums) must be written in the same rows as the corresponding input pairs but in the bit slices 2m through 3m. One possible algorithm that pairwise adds all the N pairs in parallel could be as follows... ]

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to integrate the method for performing multiplications and additions or multiply-accumulate operations in an parallel processing systems using multiply accumulate units  as disclosed by Cad with the method of obtaining inner product results by performing multiplications and additions or multiply-accumulate operations on multipliers and adders or multiply accumulate units as disclosed by David.
One of ordinary skill in the arts would have been motivated to combine the teaches in Cad and David to improve the distribute the processing of parallel inner-products in an efficient manner and reduce computational complexity (David, 0047 and  0096-0097 ). Thus improving the computational efficiency of the inner-product operations when using more than one multiply accumulate units.

Regarding claim 12, the rejection of claim 1. Cad further teaches the claim limitation(s):
wherein each of the hardware basic processing integrated circuits further comprises at least one of an inner-product arithmetic unit circuit or an accumulator circuit. [in 0031: … Each PE 108 performs arithmetic logic unit (ALU) and multiply operations, as well as a multiple-accu­mulate operation (claimed accumulator circuit and inner-product arithmetic unit ) in a single cycle…]
While Cad does not expressly disclose the inner product calculations is based on performing multiplications and additions or multiply-accumulate operations on traditional multipliers and adders or multiply accumulate unit. Cad does disclose in 0034: … It appears that the most common method for implementing inner product calculations is based on performing multiplications and additions or multiply-accumulate operations on traditional multipliers and adders or multiply accumulate units…
The Cad and David references would have been recognized by those of ordinary skill in the art as useful for applicant’s purpose in developing information processing products for parallel processing applications.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to integrate the method for performing multiplications and additions or multiply-accumulate operations in an parallel processing systems using multiply accumulate units  as disclosed by Cad with the method of obtaining inner product results by performing multiplications and additions or multiply-accumulate operations on multipliers and adders or multiply accumulate units as disclosed by David.
One of ordinary skill in the arts would have been motivated to combine the teaches in Cad and David to improve the distribute the processing of parallel inner-products in an efficient manner and reduce computational complexity (David, 0047 and  0096-0097 ). Thus improving the computational efficiency of the inner-product operations when using more than one multiply accumulate units.

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing 

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure are listed below:
Callahan, II et al. (US Pub. No. 2011/0314256): teaches kernels is a function to be preformed in a data parallel task.
Lu et al (US Pat. No. 10,073,816): teaches processing convolution operations using processor engines as the basic elements in a parallel computing environment using a main and branch circuit hierarchy as depicted in Fig. 6. And broadcasting data and distributing data in a hierarchical structure in a parallel computing environment.
Nekuii et al. (US Pub. No. 2017/0344880): teaches processing convolution operations using tensor engines as the basic elements in a parallel computing environment using a host main circuit and  deep leaning branch processors that are used to execute basic processing functions on IC circuits.
Turner et al. (US Patent. No. 5956703): teaches processing successive layers of neural network in a parallel computing environment using a master circuit and slave branch processors that are used to execute basic processing functions on IC circuits. 
Zou et al. (US Pat. No. 9, 607, 355) : teaches the three unit structural hierarchy of having a main processor on a CPU co-processing with GPU devices to execute threads as basic units in a parallel processing environment. 
Reyes et al (NPL: “Prediction of progesterone receptor inhibition by high-performance neural network algorithm”): teaches the GPU architecture that has a hierarchical architecture as depicted in Fig. 10:

    PNG
    media_image4.png
    674
    859
    media_image4.png
    Greyscale

Karam et al. (NPL: “Memory-Centric Reconfigurable Accelerator for Classification and Machine Learning Applications”): teaches the use of processing elements clustered to process operations in a hierarchical form using an I/O interface to the main unit for controlling the read/write and logic controls; where the input images are distributed to each PE along with the filter coefficients perform the convolution operations in parallel, in Sec. 4.3 and Sec. 3.4.
Tapiador-Morales et al. (NPL: “Comprehensive Evaluation of OpenCL-based Convolutional Neural Network Accelerators in Xilinx and Altera FPGAs”, hereinafter ‘Tap’): teaches Tap teaches the parallel computing platform that consist of a host computer connected to several devices as depicted Fig. 3, 






Teng et al. (US Patent Application Publication No. 2019/0114534): Teng teaches the computing system for processing machine learning algorithms using hardware accelerators.
Du et al. (US Patent Application Publication No. 2019/0087716): Du teaches the use of processing with multi-core processing module units to parallelized options of neural network processing using an accelerator.
Xie et al (US Patent Application Publication No. 2018/0157969): Xie teaches the computation of convolutional neural network.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to OLUWATOSIN ALABI whose telephone number is (571)272-0516.  The examiner can normally be reached on Monday-Friday, 8:00am-5:00pm EST..
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann Lo can be reached on (571) 272-9767.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a 






/O.O.A./Examiner, Art Unit 2126   
/ANN J LO/Supervisory Patent Examiner, Art Unit 2126