DETAILED ACTION
This action is in response to communications filed on 12/02/2020 in which claims 1, 3-4, 6-11, 13-17, and 19-20 have been amended; and claims 2, 5, and 12 have been cancelled; and claims 1, 3-4, 6-11, and 13-20 are still pending. 

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Applicant’s claim for the benefit of a prior-filed U.S. Application No. 16/168, 778, filed on October 23, 2018, which is a continuation application of International Application No. PCT/CN2017/099991, filed August 3L 2017, which is acknowledged.

Information Disclosure Statement
The information disclosure statements (IDSs) submitted on 10/24/2019, 03/09/2020, and 05/19/2020 have been considered by the examiner. Only the non-patent document provided in English have been considered as noted in the annotated in the IDS documents.
 English abstract only for the foreign patent documents because the full document has not been available in English.
Drawings
The drawings were received on 06/09/2015.  These drawings are acceptable.




Specification
The substitute specification filed 12/09/2020 have been reviewed and entered.

Response to Arguments
Applicant's arguments filed 12/02/2020  have been fully considered.

Regarding the objection to the specification, applicant has submitted the appropriate changes and the objection made in the previous action has been withdrawn.

Regarding the Double Patenting Rejection, the applicant amended the claim language and the rejection made in the previous office action has been withdrawn.

Regarding the rejection of claims under 35 USC § 112(b), the applicant has removed the problematic terms and the rejection made in the previous office action has been withdrawn.

Regarding the rejection of claims under 35 USC § 103, the applicant’s arguments have been fully considered. 
Applicant’s arguments with respect to claims have been considered but are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument. New art has been cited to address amended claim limitations.


Claim Rejections - 35 USC § 112- New Matter
Claims 1, 3-4, 6-11, and 13-20 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention. 
Regarding claim 1, the claim recites the limitation “rearranging, by the processing device, the input data to move data in the third dimension of each three dimensional sample in the input data from one of the two middle layers to the innermost layer, wherein the data moved to the innermost layer are arranged consecutively along the third dimension of that three-dimensional sample” contains subject matter not described in the original specification.  Specifically, the specification as filed 12/24/2019, with the original disclosure, describes the use of multi-dimensional structures that can be rearranged when preforming convolutional operations using 4-dimensional data structures, in [0005]-[0007] and in [001007]-[00111]; and no recitation is described the movement process as recited in the amended claim limitations. The specification is silent regarding a movement process as described by the newly amended claim limitation and the applicant has not provided the particular paragraphs that provide support for the newly amended claim limitation; therefore the claim limitation contains new matter not described in the originally specification.
Regarding claim 11, the claim recites similar content to claim 1 and is therefore rejected under the same rationale.
Regarding claims 1, 3-4, and 6-10 that dependent on claim 1, the claim limitations do not resolve the deficiency noted above; and are therefore appropriately rejected. 
Regarding claims 13-20 that dependent on claim 11, the claim limitations do not resolve the deficiency noted above; and are therefore appropriately rejected.


Claim Rejections - 35 USC § 112- Indefiniteness 
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claim 6 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor, or for pre-AIA  the applicant regards as the invention.

Regarding claim 6, the claim recites the limitation "the second plurality of basic blocks”.  There is insufficient antecedent basis for this limitation in the claim.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 3-4, 6, 8-11, and 13-20 are rejected under 35 U.S.C. 103 as being unpatentable over Parasharet al. (NPL: “Scnn: An accelerator for compressed-sparse convolutional neural networks”, .

Regarding independent claim 1 limitations, Parash teaches a convolution operation method performed by a processing device:
receiving, by the processing device, input data and weight data, wherein: (Parash: receiving weight data and the input data as image data represented as activation volumes of the CNN as depicted in Figure 2: CNN computations and parameter inputs including weight and input data, in Sec. 2: 1st para. :,... During inference, a new image (in the case of image recognition) is presented to the network, which classifies into the training categories by computing each of the layers in the network, in succession. The intermediate data between the layers are called activations, and the output activations of one layer becomes the input activations of the next layer.

    PNG
    media_image1.png
    364
    641
    media_image1.png
    Greyscale
), 
the input data represent (Parash teaches input activation data as depicted in Fig. 2 as the representation of input data, in Sec. 2: 1st para. : Convolutional Neural Network algorithms (CNNs) are essentially a cascaded set of pattern recognition filters that need to be trained [23]. A CNN consists of a series of layers,... During inference, a new image (in the case of image recognition) is presented to the network, which classifies into the training categories by computing each of the layers in the network, in succession. The intermediate data between the layers are called activations, and the output activations of one layer becomes the input activations of the next layer.) a first plurality of three­dimensional samples in a neural network, each three-dimensional sample including first and second dimensions defining a sliding operation window for performing the convolution operation method and a third dimension defining a depth of the sliding window, wherein the input data are arranged as a four­dimensional data block including an innermost layer, two middle layers, and an outermost layer wherein the innermost layer corresponds to the first dimension, the two middle layers correspond to the second and third dimensions, and the outermost layer correspond to a number of the first plurality of three-dimensional samples; and  (See Fig. 2; Parash teaches receiving the input activation and weights for processing convolutions using sliding operations, in pg. 30: Right Col. 1st full. Para: The core operation in a CNN convolutional layer is a 2-dimensional sliding-window convolution of an R × S element filter over a W ×H element input activation plane to produce a W ×H element output activation plane. The data can include multiple (C) input activation planes, which are referred to as input channels. A distinct filter is applied to each input activation channel, and the filter outputs for each of the C channels are accumulated together element-wise into a single output activation plane or output channel… Figure 2 shows these parameters applied to the computation of a single CNN layer.)
the weight data represent a second plurality of three-dimensional convolution kernels (Parash teaches in Fig.2 a plurality of 3-D weights), each three-dimensional convolution kernel including first and second dimensions defining a two-dimensional weight matrix  size of each two-dimensional weight matrix defines a size of the sliding operation window; (See Fig. 2; Parash teaches receiving the input activation and weights for processing convolutions using sliding operations, in pg. 30: Right Col. 1st full. Para: The core operation in a CNN convolutional layer is a 2-dimensional sliding-window convolution of an R × S element filter over a W ×H element input activation plane to produce a W ×H element output activation plane. The data can include multiple (C) input activation planes, which are referred to as input channels. A distinct filter is applied to each input activation channel, and the filter outputs for each of the C channels are accumulated together element-wise into a single output activation plane or output channel… Figure 2 shows these parameters applied to the computation of a single CNN layer.)
rearranging, by the processing device, the input data to move data in the third dimension of each three­dimensional sample in the input data from one of the two middle layers to the innermost layer, wherein the data moved to the innermost layer are arranged consecutively along the third dimension of that three-dimensional sample; and (Parash teaches preforming rearranging the input to compute an output sub-volume representation of the input data as depicted in Fig. 2, in Sec. 3.1: 4th full para: …Blocking the weights and partial sums in the output channel (K) dimension can increase reuse of these data structures and improve energy efficiency. We therefore factor the K output channels into K/Kc output-channel groups of size Kc, and only store weights and outputs for a single output-channel group at a time inside the weight and accumulator buffers. Thus the sub-volumes that are housed in buffers at the computation unit are: Weights: C ×Kc ×R × S, Inputs: C ×W ×H, and Partial Sums: Kc ×W ×H…)
performing, by the processing device, a convolution operation (Parash: computing partial sums for creating sub-volumes) between the rearranged input data and the weight data by performing multiple partial convolution operations between corresponding sliding operation windows from the rearranged input data and three-dimensional convolution kernels from the weight data in parallel. (Parash teaches convolution operations corresponding to sliding operations windows from the rearranged input data as multiplied computations in parallel computation units for accumulating partial sums from the rearranged input data and weight data, in Sec. 3.1: … First, con-sider the operation of a scalar processing element (PE) with a single multiply-accumulate unit. We employ an input-stationary (IS) com-putation order in which an input activation is held stationary at the computation units as it is multiplied by all of the filter weights needed to make all of its contributions to each of the the K output channels (a K ×R × S sub-volume). Thus each input activation will contribute to a volume of K ×W ×H output activations…; and in Sec. 4.1,…Input weights and activations. Each PE’s state machine oper-ates on the weight and input activations in the order defined by the PT-IS-CP-sparse dataflow to produce a output-channel group of Kc ×Wt ×Ht partial sums inside the accumulation buffers. First, a vector F of compressed weights and a vector I of compressed input activations are fetched from their respective buffers. These vectors are distributed into the F×I multiplier array which computes a form of the Cartesian product of the vectors, i.e, every input activation is multiplied by every weight to form a partial sum.)

Mairal does teach Cartesian products compressing an inner-product operations/calculations [in pg. 4, Sec. Idea 4: … 
    PNG
    media_image2.png
    74
    889
    media_image2.png
    Greyscale
…]
The Parash and Mairal references would have been recognized by those of ordinary skill in the art as useful for applicant’s purpose developing method for executing convolution operations using multi-dimensional representations.
It world have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to combine the teaches of Parash and Mairal. Mairal teaches performing neural network multi-dimensional operations using kernel representations in the Cartesian product space. Parash teaches a method for distributing data for processing convolutional product operations using multiple dimensional data structures of the neural network data sets using the Cartesian product. 
One of ordinary skill in the art would have had motivation to encode the data associated with convolutional operations using multi-dimensional data structures in the Cartesian product space comprising the inner-product, as disclosed in Mairal, with the  to improve the efficiency in the method for distributing data for processing convolutional product operations using multiple dimensional data structures, as disclosed by Parash, in order to provide multi-layer constructions for representing image data in convolution neural networks (Mairal, pg. 4, Sec. Idea 4) to help support learning using data representation methods that control the learning capacity and reduce over-fitting, (Mairal, Sec. 1: nd para.). Therefore, providing a more efficient processing of the multilayer data construction (Mairal, pg. 4, Sec. Idea 4)

Regarding claim 3, the rejection of claim 1 is incorporated and Parash in combination with Mairal further teaches the convolution operation method of claim 1,
wherein the processing device includes a main processing circuit (Parash teaches a main processing circuit on-chip for executing the SNN architecture and functions, in Pg. 39 Se. 7: last para. : … By compressing the activations, SCNN can typically keep all of the activations on-chip, without requiring a more complicated fused algorithm.) and a plurality of basic processing circuits (Parash teaches plurality of processing elements), and wherein performing the convolution operation between the rearranged input data (Parash input-activation volume) and the weight data further includes: dividing, by the main processing circuit, the weight data into a plurality of basic data blocks; (Parash teaches dividing weight data into blocks along the k dimension, as depicted in Fig. 2, for processed by main SCNN architecture circuit to broadcast to the PE circuits, in Pg. 31: Sec. 3.1 …Inter-PE parallelism. … We employ a spatial tiling strategy to spread the work across an array of PEs so that each PE can operate independently. The W×H element activation plane is partitioned into smaller Wt ×Ht element planar tiles (PT) that are distributed across the PEs. Each tile extends fully into the input-channel dimension C, resulting in an input-activation volume of C×Wt ×Ht assigned to each PE. Weights are broadcast to the PEs, and each PE operates on its own subset of the input and output activation space…)
distributing, by the main processing circuit, the plurality of basic data blocks to the plurality of basic processing circuits (Parash broadcasting [i.e. distributing] different weight sub-blocks to each PE, in pg. 31: Sec. 3.1: Parash Inter-PE parallelism; Fig. 2 depicts the weights as filter distributed to each Kc input volume, in pg. 31: Sec. 3.1), wherein each of the plurality of basic data blocks is distributed to one of the plurality of the basic processing circuits and at least two basic processing circuits receive different basic data blocks; and (Parash teaches dividing weight data into blocks along the k dimension, as depicted in Fig. 2, for processed by main SCNN architecture circuit to broadcast to the PE circuits, in Pg. 31: Sec. 3.1 … Thus the sub-volumes that are housed in buffers at the computation unit are: Weights: C ×Kc ×R × S, Inputs: C ×W ×H … Each iteration of this outer loop will require the weight buffer to be refilled and the accumulator buffer to be drained and cleared, while the contents of the input buffer will be fully reused because the same input activations are used across all output channels.. … We employ a spatial tiling strategy to spread the work across an array of PEs so that each PE can operate independently. The W×H element activation plane is partitioned into smaller Wt ×Ht element planar tiles (PT) that are distributed across the PEs. Each tile extends fully into the input-channel dimension C, resulting in an input-activation volume of C×Wt ×Ht assigned to each PE. Weights are broadcast to the PEs, and each PE operates on its own subset of the input and output activation space…)
broadcasting, by the main processing circuit, at least a portion of the rearranged input data (input-activation volume/sub-blocks, depicted in Fig. 2) to the plurality of basic processing circuits (Parash broadcasting [i.e. distributing] input-activation sub-blocks to each PE), wherein each of the plurality of basic processing circuits receives the same portion of the rearranged input data (Parash input-activation volume of C×Wt ×Ht assigned to each PE). (Parash teaches dividing weight data into blocks along the k dimension, as depicted in Fig. 2, for processed by main SCNN architecture circuit to broadcast to the PE circuits, in Pg. 31: Sec. 3.1 …Thus the sub-volumes that are housed in buffers at the computation unit are: Weights: C ×Kc ×R × S, Inputs: C ×W ×H … Each iteration of this outer loop will require the weight buffer to be refilled and the accumulator buffer to be drained and cleared, while the contents of the input buffer will be fully reused because the same input activations are used across all output channels.…)

Regarding claim 4, the rejection of claim 3 is incorporated and Parash in combination with Mairal further teaches the convolution operation method of claim 3,
wherein performing the convolution operation between the rearranged input data and the weight data further includes: performing, by each of the plurality of basic processing circuits (The W×H element activation plane is partitioned into smaller Wt ×Ht element planar tiles (PT) that are distributed across the PEs.), operations on the portion of the rearranged input data broadcast to that basic processing circuit and one or more basic data blocks distributed to that basic processing circuit to obtain an operation result (Partial Sums: Kc ×W ×H; A vector of F filter-weights fetched from the weight buffer and a vector of I inputs fetched from the input activation buffer are delivered to an array of F×I multipliers to compute a full Cartesian product (CP) of output partial-sums [respective operational results].); providing, by the plurality of basic processing circuits the respective operation results to the main processing circuit (Input weights and activations. Each PE’s state machine oper-ates on the weight and input activations in the order defined by the PT-IS-CP-sparse dataflow to produce a output-channel group of Kc ×Wt ×Ht partial sums inside the accumulation buffers [provided operational partial sums to the main processor accumulator], in pg. 33: Sec. 4.2.); and calculating, by the main processing circuit, a convolution operation result according to the operation results provided by the plurality of basic processing circuits. (Parash: The PE array is driven by a layer sequencer that orchestrates the movement of weights and activations and is connected to a DRAM controller that can broadcast weights to the PEs and stream activations to/from the PEs. SCNN can use an arbitrated bus as the global network to facilitate the weight broadcasts, the point-to-point delivery of input activations (IA) from DRAM, and the return of output activations (OA) back to DRAM [a convolution operation result according to the operation results provided by the plurality of basic processing circuits]...., in pg. 33: Sec. 4.1), 


Regarding claim 6, the rejection of claim 3 is incorporated and Parash in combination with Mairal further teaches the convolution operation method of claim 3,
wherein dividing the weight data into a plurality of basic data blocks further includes (Each tile extends fully into the input-channel dimension C, resulting in an input-activation volume of C×Wt ×Ht assigned to each PE. Weights are broadcast [by the main processing circuit] to the PEs,): dividing, by the main processing circuit, the weight data into the second plurality of basic data blocks each comprising a three-dimensional convolution kernel. (dividing the weight data into kernel/tile processing data blocks as depicted in Fig. 2; in Pg. 31: Sec. 3.1 … Thus the sub-volumes that are housed in buffers at the computation unit are: Weights: C ×Kc ×R × S, Inputs: C ×W ×H … Each iteration of this outer loop will require the weight buffer to be refilled )
	
	

wherein broadcasting at least a portion of the rearranged input data to the plurality of basic processing circuits further includes: sliding, by the main processing circuit, the sliding operation window in the rearranged input data, wherein the size of the sliding operation window is equal to a size of each basic data block; (in pg. 30: Right Col. 1st full para.: The core operation in a CNN convolutional layer is a 2-dimensional sliding-window convolution of an R × S element filter over a W ×H element input activation plane to produce a W ×H element output activation plane [wherein the 2-D size of the sliding operation window is equal to a size of each basic data block]. The data can include multiple (C) input activation planes, which are referred to as input channels. A distinct filter is applied to each input activation channel, and the filter outputs for each of the C channels are accumulated together element-wise into a single output activation plane or output channel... Figure 2 shows these parameters applied to the computation of a single CNN layer.) and extracting, by the main processing circuit, data within the sliding operation window at each sliding position as the portion of the rearranged input data for broadcasting to the plurality of basic processing circuits. (in pg. 30: Right Col. … A CNN’s dataflow defines how the loops are ordered, partitioned, and parallelized [7]…; and depicted in Fig. 3 and Fig. 4 as the data extraction process over the dimensions of the sliding window:

    PNG
    media_image3.png
    368
    572
    media_image3.png
    Greyscale
 
    PNG
    media_image4.png
    701
    608
    media_image4.png
    Greyscale


 )

Regarding claim 9, the rejection of claim 4 is incorporated and Parash in combination with Mairal further teaches the convolution operation method of claim 4,
wherein performing the operations on the portion of the rearranged input data broadcast to that basic processing circuit and one or more basic data blocks distributed to that basic processing circuit to obtain an operation result further includes: performing, by each of the plurality of basic processing circuits, multiplication on element values of the portion of the rearranged input data and element values at corresponding positions of the one or more basic data blocks to obtain a plurality of multiplication results; (in pg. 30 Sec. 3: While the inner core of the dataflow in SCNN is based on a spatial Cartesian product, the complete dataflow requires a deep nested loop structure, mapped both spatially and temporally across multiple processing elements.; to perform multiplication operations to compute the partial sums by broadcasting the rearranged activation input data, in pg. 33: Sec. 4.2: … To process the first CNN layer, the layer sequencer streams a portion of the input image into the IARAM of each PE and broadcasts the compressed-sparse weights into the weight buffer of each PE… Each PE’s state machine oper-ates on the weight and input activations in the order defined by the PT-IS-CP-sparse dataflow to produce a output-channel group of Kc ×Wt ×Ht partial sums inside the accumulation buffers… )
and providing, by each of the basic processing circuits, the plurality of multiplication results to the main processing circuit; and calculating, by the main processing circuit, the convolution operation result includes: accumulating, by the main processing circuit, the plurality of multiplication results provided by each of the basic processing circuits to obtain a convolution result for each basic processing circuit; and sorting, by the main processing circuit, a plurality of convolution results to obtain the convolution operation result. (Accumulating the partial sums provided by the PE basic circuits to sort and compute the output activation return to the main process circuit DRAM, in Pg. 33: Left Col. Sec 4.1 and 4.2: …The PE array is driven by a layer sequencer that orchestrates the movement of weights and activations and is connected to a DRAM controller that can broadcast weights to the PEs and stream activations to/from the PEs. SCNN can use an arbitrated bus as the global network to facilitate the weight broadcasts, the point-to-point delivery of input activations (IA) from DRAM, and the return of output activations (OA) back to DRAM...; and the sorting processing by indexing, in Sec. 4.2: Input weights and activations: Each PE’s state machine oper-ates on the weight and input activations in the order defined by the PT-IS-CP-sparse dataflow to produce a output-channel group of Kc ×Wt ×Ht partial sums inside the accumulation buffers.)


wherein: performing the operations on the portion of the rearranged input data broadcast to that basic processing circuit and one or more basic data blocks distributed to that basic processing circuit to obtain an operation result further includes: performing, by each of the plurality of basic processing circuits, multiplication on element values of the portion of the rearranged input data and element values at corresponding positions of the one or more basic data blocks to obtain a plurality of multiplication results; (positions depicted in Figures 2-3; in pg. 30 Sec. 3: While the inner core of the dataflow in SCNN is based on a spatial Cartesian product, the complete dataflow requires a deep nested loop structure, mapped both spatially and temporally across multiple processing elements.; to perform multiplication operations to compute the partial sums by broadcasting the rearranged activation input data, in pg. 33: Sec. 4.2: … To process the first CNN layer, the layer sequencer streams a portion of the input image into the IARAM of each PE and broadcasts the compressed-sparse weights into the weight buffer of each PE… Each PE’s state machine oper-ates on the weight and input activations in the order defined by the PT-IS-CP-sparse dataflow to produce a output-channel group of Kc ×Wt ×Ht partial sums inside the accumulation buffers… )
accumulating, by each of the basic processing circuits, the plurality of multiplication results to obtain a convolution result; and providing, by each of the basic processing circuits, the convolution result to the main processing circuit; and calculating, by the mam processing circuit, the convolution operation result includes: sorting, by the main processing circuit, plurality of convolution results to obtain the convolution operation result. (Accumulating the partial sums provided by the PE basic circuits to sort and compute the output activation return to the main process circuit DRAM to compute the output activation result at a CCN layer, in Pg. 33: Left Col. Sec 4.1 and 4.2: …The PE array is driven by a layer sequencer that orchestrates the movement of weights and activations and is connected to a DRAM controller that can broadcast weights to the PEs and stream activations to/from the PEs. SCNN can use an arbitrated bus as the global network to facilitate the weight broadcasts, the point-to-point delivery of input activations (IA) from DRAM, and the return of output activations (OA) back to DRAM...; and the sorting processing by indexing, in Sec. 4.2: Input weights and activations: Each PE’s state machine oper-ates on the weight and input activations in the order defined by the PT-IS-CP-sparse dataflow to produce a output-channel group of Kc ×Wt ×Ht partial sums inside the accumulation buffers.)

Regarding independent claim 11 limitations, the limitations are similar to the limitations in claim 1 and are therefore rejected under the same rationale.

Regarding claim 13, the rejection of claim 11, the limitations are similar to the limitations in claim 3 and are therefore rejected under the same rationale.

Regarding claim 14 the rejection of claim 13, the limitations are similar to the limitations in claim 4 and are therefore rejected under the same rationale.

Regarding claim 15 the rejection of claim 13, the limitations are similar to the limitations in claim 6 and are therefore rejected under the same rationale.


the limitations of claims 16 and 17 are similar to the limitations in claims 9 and 10 respectively and are therefore rejected under the same rationale.

Regarding claim 18, the rejection of claim 11 is incorporated and Parash in combination with Mairal further teaches the processing device of claim 11,
further comprising branch processing circuits configured to connect the main processing circuit to a plurality of basic processing circuits, and transmit data among the main processing circuit and the plurality of basic processing circuits.  (in pg. 28, Left. Col, last two para & Right Col. 1st full para..: depicted in Figs. 5 and 6 branch circuit cross layer DRAM network for communication data to buffers for processing by the PE basic circuits form the main circuit: … As with any CNN accelerator, SCNN must accumulate the partial products generated by the multi-pliers…. First, maintaining the weights and activations in a compressed form throughout the pipeline re-duces energy-hungry data staging and transmission costs. Second, the entire volume of activations of larger CNNs can remain in on-die buffers between layers, entirely eliminating expensive cross-layer DRAM references for a large number of networks… We also implemented an SCNN PE in synthesizable System C and compiled the design into gates using a combination of commercial high-level synthesis (HLS) tools and a traditional Verilog compiler…

    PNG
    media_image5.png
    451
    619
    media_image5.png
    Greyscale
                  
    PNG
    media_image6.png
    566
    640
    media_image6.png
    Greyscale


 )

Regarding claim 19, the rejection of claim 11 is incorporated and Parash in combination with Mairal further teaches the processing device of claim 11,
wherein the main processing circuit includes at least one of a vector arithmetic unit circuit, an arithmetic logic unit (ALU) circuit (in Sec. 4.4: ALU for conducting arithmetic operations: SCNN compresses weights and activations to reduce both arithmetic operations and data movement…), an accumulator circuit, a matrix transposition circuit, a direct memory access (DMA) circuit or a data rearrangement circuit (in Sec. 4.4: data movement and reduction circuit: SCNN compresses weights and activations to reduce both arithmetic operations and data movement…). 

Regarding claim 20, the rejection of claim 11 is incorporated and Parash in combination with Mairal further teaches the processing device of claim 11,
wherein each of the plurality of basic processing circuits includes an inner-product arithmetic unit circuit (in pg. 33, Sec. 4.2 …These vectors are distributed into the F×I multiplier array which computes a form of the Cartesian product of the vectors, i.e, every input activation is multiplied by every weight to form a partial sum... ) or an (in pg. 33, Sec. 4.2: … The F×I products are delivered to an array of A accumulator banks, indexed by the output coordinates; and depicted in Fig 6


    PNG
    media_image6.png
    566
    640
    media_image6.png
    Greyscale

) 


Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over Parasharet al. (NPL: “Scnn: An accelerator for compressed-sparse convolutional neural networks”, hereinafter ‘Parash’) in view of Mairal (NPL: “End-to-End Kernel Learning with Supervised Convolutional Kernel Networks”) and in further view of Lan et al. (NPL: “DLPlib: A Library for Deep Learning Processor”, hereinafter ‘Lan’).

Regarding claim 7, the rejection of claim 3 is incorporated and Parash in combination with Mairal further teaches the convolution operation method of claim 3,
wherein dividing the weight data into a plurality of basic data blocks further includes (Each tile extends fully into the input-channel dimension C [the first quantity as depicted in Fig. 2], resulting in an input-activation volume of C×Wt ×Ht assigned to each PE. Weights are broadcast [by the main processing circuit] to the PEs,: dividing, R × S as two dimensional weight matrix), wherein the second quantity is equal to a multiplication product of a number of the second plurality of three-dimensional convolution kernels and the first quantity (the second quantity is equal to C ×Kc). (Thus the sub-volumes that are housed in buffers at the computation unit are: Weights: C ×Kc ×R × S, Inputs: C ×W ×H … Each iteration of this outer loop will require the weight buffer to be refilled).
While Parash depicts the use of matrix/vector operations to process the multi-dimensional kernel depicted in Fig.2, in Fig. 7 as a 2-D weight matrix, in pg. 33. Parash does not expressly teach the 2-D weight representation as a matrix computations. 
Lan does teach performing the tensor data structures are vector/matrix operations, in Abstract: It contains two major data structures, tensor and filter, and a set of operators including basic neural network primitives and matrix/vector operations; and in pg. 287 Sec. Intro; Last para: Operators include memory operators and com-putational operators. The former is in charge of copy-ing and allocating memories, and the latter includes a set of deep learning primitives (e.g., convolution, pool-ing) and common basic matrix/vector operations (e.g., matrix multiplication). We build the library on an architecture similar to the recently proposed accelera-tor, Cambricon-X .
The Parash, Mairal, and Lan references would have been recognized by those of ordinary skill in the art as useful for applicant’s purpose in parallel architecture for executing deep neural networks operations.
It world have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teaches of Parash and Lan. Lan teaches performing neural network multi-dimensional operations as tensor structures performing vector and matrix operations. 
One of ordinary skill in the art would have had motivation to apply the use of basic vector/matrix operations with the process for performing convolutional operations using multi-dimensional data structures to optimize the data structures easier without compromising the generality and speed-up processing times (Lan, Abstract and pg. 287: Right Col. 1st partial para.).


	Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure are listed below:

Dally et al. (US Pub. No. 2018/0046900): teaches the SCNN processing system in Parashar reference used to compute convolution operations.


Chien et al. (NPL: “Tensor-factorized neural networks): teaches tensors used to perform convolution operations in an artificial neural network deep architecture. 


Wang, M., Liu, B., & Foroosh, H. (2017). (NPL: “Factorized convolutional neural network”): teaches the use of circuitry for processing neural network convolutions using 3-D tensor with a forth dimension that captures the number of blocks for processing the sliding window in the 3-D space, see Fig. 2.

Huang et al. (NPL: “Characterizing types of convolution in deep convolutional recurrent neural networks for robust speech emotion recognition”): teaches types on convolutional operations including pooling functions and pooling operations.

Suda et al. (NPL: “Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks”): teaches method for CNN accelerator optimization using parallel computing resources and OpenCL implementation of matrix multiplication operations using MAC; and interfacing IPs as branch circuits for communications operations between the host and accelerator board.

Chellapilla et al.  (NPL: “High Performance Convolutional Neural Networks for Document Processing”): teaches M as the number of inputs associated with the computing the matrix multiplications resulting in the feature map kernels.

Pande et al. (NPL: “Matrix Convolution using Parallel Programming”): teaches parallel processing of the matrix convolution operations using convolution filters to perform multiplication operations.

Tsai et al. (NPL” Performance-Portable Autotuning of OpenCL Kernels for Convolutional Layers of Deep Neural Networks”): teaches the processing data type as 3 dimensional objects with a notation for the number of 3-D objects as the N, and K value, as depicted in Fig. 2.

Wasserman et al. (US Patent No. 7,737,994): teaches the processing of large kernel convolutions using graphic accelerators to optimize the computations time using parallel computing. 

Goyal et al. (US Pub. No. 2017/0316312): teaches deep learning processor for processing matrix-matrix multiplication operations for a convolutional network. 

Cohen et al. (US Pub. No. 2019/0102671): teaches method for preforming an inner product as a convolution operation in a parallel computing environment.


Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann Lo can be reached on (571) 272-9767.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/O.O.A./Examiner, Art Unit 2126  
/ANN J LO/Supervisory Patent Examiner, Art Unit 2126