DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
1.	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Status of Claims
2.	The following claim(s) is/are pending in this Office action: 1-20. 
3.	Claim(s) 1-20 are rejected.  This rejection is NON-FINAL.

Information Disclosure Statement
4.	The information disclosure statement (IDS) submitted on 11/20/2020 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Specification
5.	The disclosure is objected to because of the following informalities:
(a)	¶ [0077] (as published in USPG Pub 2021/0019591): This paragraph includes a minor informality. The examiner suggests amending ¶ [0077] to recite “Similar to the above technique that avoids multiple fetching of stationary operand(s), the number of times the streaming operand is to be fetched can be reduced for the case where the GEMM is larger than AcclX but input feature count is less than AcclY×‘r’. In this case, the reuse count of the streaming operand can be increased by up to ‘r’ times.”

Appropriate correction is required.

Claim Objections
6.	Claims 1-2, 7, and 11 stand objected to because of the following informalities:  
(a)	Claims 1 and 11:
(1)	The claimed limitation “assigning, by the processor, a first group of PEs in the PE array, to a first one of the subarrays” contains a grammatical error – the comma between “array” and “to” should be removed.  The examiner suggests amending this limitation to recite “assigning, by the processor, a first group of PEs in the PE array[[,]] to a first one of the subarrays”.
 (b)	Claim 7:
 (1)	The limitation “generating, by multiplier and accumulator (MAC) circuitry …” contains a clerical informality of missing an article for “multiplier and accumulator (MAC) circuitry”. The examiner suggests amending the above limitation to recite “generating, by a multiplier and accumulator (MAC) circuitry …”.
Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

7.	Claims  rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
  (a)	Claims 3 and 13:
(1) The two recitations of “a size” and two recitations of “the size” are indefinite because it is unclear whether these two “sizes” are the same size of different sizes.  For the purpose of examination, these two sizes are interpreted as different sizes as recited in claim 3.  Thus, distinguishing modifiers are required for these two recitations of “size”.
(2) The two recitations of “positions” are indefinite because it is unclear whether these two “positions” are the same positions of different positions.  For the purpose 
(3) The limitations “the first array in the first dimension” and “the PE array in the first dimension” are indefinite because it is unclear whether the PE array in the first dimension is identical to or different from the first array in the first dimension.  For the purpose of examination, the first array in the first dimension is interpreted as different from the PE array in the first dimension.
(4) The limitation “in response to determining that the size of the first array in the first dimension is greater than the size of the PE array in the first dimension” is indefinite because it is unclear why the size of the first array is different from (greater than) the size of the PE array if the limitation “the dimension” has the same antecedent basis.  
 (b)	Claims 8 and 18:
(1)	The limitation “the first number of computed dot products” lacks proper antecedent basis. The examiner suggests amending the limitation to recite “a 
(c)	Claims 9 and 19:
(1)	The limitation “the second number of computed dot products” lacks proper antecedent basis The examiner suggests amending the limitation to recite “a second number of computed dot products”.

(1) The two recitations of “input data streams” are indefinite as it is unclear whether these two recitations of “input data streams” are identical or different.  For the purpose of examination, these two recitations are interpreted as the same input data streams.
Clarification is required. 


Claim Rejections - 35 USC § 101
8.	35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1- rejected under 35 U.S.C. 101 because the claimed invention is directed to a judicial exception without significantly more. 

Step 1: claims 1-10 are directed to the statutory class of processes; and claims 11-20 are directed to the statutory class of machines.
 
Step 2A – Prong One: This part of the eligibility analysis evaluates whether the claim recites a judicial exception. As explained in MPEP 2106.04(II) and the October 2019 Update, a claim “recites” a judicial exception when the judicial exception is “set forth” or “described” in the claim.
With respect to claim 1, claim 1 recites the abstract idea as shown in the following bolded limitations. The non-bolded limitations denote additional elements that are analyzed under Step 2A Prong Two and Step 2B below.  Claim 11 also recites identical and/or substantially similarly limitations as claim 1 and is rejected in the same manner, the same art and reasoning applying. 
receiving, by a processor, input data to generate a plurality of outputs for a layer of a neural network, the plurality of outputs being arranged in a first array;
comparing, by the processor, dimensions of the first array with dimensions of a processing unit (PE) array comprising a plurality of PEs;
partitioning, by the processor according to a result of the comparing, the first array into subarrays each having dimensions less than or equal to the dimensions of the PE array;
assigning, by the processor, a first group of PEs in the PE array, to a first one of the subarrays; and
generating, by each PE of the first group of PEs assigned to the first one of the subarrays, a corresponding output of the plurality of outputs using a portion of the input data. 
(a)       Step 2A Prong One:
(1)      Regarding the limitation comparing, by the processor, dimensions of the first array with dimensions of a processing unit (PE) array comprising a plurality of PEs: (mathematical principle / relationship: The examiner notes that this limitation merely recites a simple mathematical principle of comparing two numbers (e.g., dimensions) of two arrays / matrices and is thus directed to an abstract idea and fails Step 2A Prong One. See MPEP § 2106.04-(I).)  
(2)      Regarding the limitation partitioning, by the processor according to a result of the comparing, the first array into subarrays each having dimensions less than or equal to the dimensions of the PE array: (mathematical principle and/or algorithms:  The examiner notes that this limitation merely recites a mathematical principle / relationship of dividing an array/matrix into multiple portions and is thus directed to an abstract idea and fails Step 2A Prong One. See MPEP § 2106.04-I.) 
(3)      Regarding the limitation assigning, by the processor, a first group of PEs in the PE array, to a first one of the subarrays; and: (mathematical principle and/or algorithms:  The examiner notes that this limitation merely recites the basic mathematical algorithms/principle of correlating elements in one array/matrix with the corresponding elements in another array/matrix and performing basic assignment based on the correlation and is thus directed to an abstract idea and fails Step 2A Prong One. See MPEP § 2106.04-I.)
(4)      Regarding the limitation generating, by each PE of the first group of PEs assigned to the first one of the subarrays, a corresponding output of the plurality of outputs using a portion of the input data: (mathematical principle and/or algorithms:  The examiner notes that this limitation merely recites the basic mathematical algorithms/principle of computing an output based on received input and is thus directed to an abstract idea and fails Step 2A Prong One. See MPEP § 2106.04-I.)

(a)      Step 2A Prong Two:
          Claim 1, when analyzed individually, recites the additional elements of “receiving, by a processor, input data to generate a plurality of outputs for a layer of a neural network, the plurality of outputs being arranged in a first array”: The examiner notes that this limitation merely constitutes data gathering which has been found to constitute insignificant extra-solution activity that fails to integrate the claimed judicial exception into 
Regarding the limitation comparing, by the processor, dimensions of the first array with dimensions of a processing unit (PE) array comprising a plurality of PEs”, “partitioning, by the processor according to a result of the comparing, the first array into subarrays each having dimensions less than or equal to the dimensions of the PE array”, and “assigning, by the processor, a first group of PEs in the PE array, to a first one of the subarrays: (The examiner notes that these three limitations, when analyzed individually, merely applies the claimed, respective abstract idea (mathematical algorithms / principle) to a processor that is recited at a high level of generality. This has been held to be insufficient to integrate the claimed judicial exception into a practical application.  See MPEP § 2106.05(f)(3) citing Internet Patents Corporation v. Active Network, Inc., 790 F.3d 1343, 1348, 115 USPQ2d 1414, 1418 (Fed. Cir. 2015).)  
Regarding the limitation generating, by each PE of the first group of PEs assigned to the first one of the subarrays, a corresponding output of the plurality of outputs using a portion of the input data:  (The examiner notes that this limitation, when analyzed individually, merely applies the claimed abstract idea (mathematical algorithms / principle) to a processing element that is also recited at a high level of generality. This has been held to be insignificant post-solution activity that is insufficient to integrate the claimed judicial exception into a practical application.  See MPEP § 2106.05(f)(3).)


(c)      Step 2B:
          Claim 1, when analyzed as an ordered combination, recites the additional elements of “receiving, by a processor, input data to generate a plurality of outputs for a layer of a neural network, the plurality of outputs being arranged in a first array”: The examiner notes that these additional elements, when analyzed as an ordered combination, merely constitute data gathering which has been held as well-understood, routine, and conventional activity of receiving or sending data over a network. See MPEP 2106.05(d) citing buySAFE Inc. v. Google, Inc., 765 F.3d 1350, 1354, 112 USPQ2d 1093, 1095-96 (Fed. Cir. 2014).  Moreover, the limitation merely applies the claimed judicial exception to a generic computer processor (“apply it”) that is recited at a high level of generality that has been held to fail to amount to significantly more than the claimed judicial exception. See MPEP 2106.05(f) citing Alice Corp. v. CLS Bank, 573 U.S. 208, 221, 110 USPQ2d 1976, 1982-83 (2014).
	Claim 1 further recites the limitations comparing, by the processor, dimensions of the first array with dimensions of a processing unit (PE) array comprising a plurality of PEs“, “comparing, by the processor, dimensions of the first array with dimensions of a processing unit (PE) array comprising a plurality of PEs”, and “partitioning, by the processor according to a result of the comparing, the first array into subarrays each having dimensions less than or equal to the dimensions of the PE array: (The examiner notes that these three limitations, when analyzed as an ordered combination, merely recite a series of mathematical algorithms and apply the respective claimed judicial exceptions (e.g., mathematical algorithms / principle) to a generic computer processor (“apply it”) that is recited at a high level of generality that has been held to fail to amount to significantly more than the claimed judicial exception. See MPEP 2106.05(f) citing Alice Corp. v. CLS Bank, 573 U.S. 208, 221, 110 USPQ2d 1976, 1982-83 (2014).)
	Lastly, claim 1 also recites the limitation generating, by each PE of the first group of PEs assigned to the first one of the subarrays, a corresponding output of the plurality of outputs using a portion of the input data: (The examiner notes that this limitation, when analyzed as an ordered combination, merely recites an insignificant post-solution activity of generating an output by applying the claimed judicial exceptions using generic computer components or functions (e.g., “processor” and “processing element”) that are recited at a high level of generality. This insignificant post-solution activity has been found to be insufficient to amount to significantly more than the claimed judicial exception to satisfy Step 2B.  See MPEP § 2106.05(g).)  
          Therefore, the examiner notes that claim 1 recites the aforementioned judicial exceptions without additional elements that amount to significantly more than the aforementioned judicial exception.  Claim 1 is thus rejected under 35 U.S.C. § 101 for at least the foregoing reasons.
 

the PE array is a two-dimensional M×N array wherein each of M and N is an integer greater than 1, and
the partitioning of the first array comprises partitioning the first array into one or more of an M×N array, an M×N/2 array, an M/2×N array, or an M/2×N/2 array.
Step 2A Prong One:
          Regarding the limitation the PE array is a two-dimensional M×N array wherein each of M and N is an integer greater than 1: (mathematical algorithm and mental process: the examiner notes that this limitation is merely directed to a mathematical algorithm that describes the dimensions of an array/matrix. See MPEP § 2106.04(a)(1). In addition, the examiner notes that this limitation is also directed to a mental process that can be performed by a human analog who observes or evaluates the dimensions of a matrix.  See MPEP § 2106.04(a).) 
          Regarding the limitation the partitioning of the first array comprises partitioning the first array into one or more of an M×N array, an M×N/2 array, an M/2×N array, or an M/2×N/2 array: (mathematical algorithm and mental process: the examiner notes that this limitation is also directed to a mathematical algorithm that partitions an array/matrix. See MPEP § 2106.04(a)(1). In addition, the examiner notes that this limitation is also directed to a mental process that can be performed by a human analog who partitions a matrix by halving one or both dimensions of a matrix.  See MPEP § 2106.04(a).)
	Therefore, the examiner asserts that claim 2 merely recites an abstract idea of a mathematical algorithm / principle and/or a mental process and thus fails Step 2A Prong One. 

Step 2A Prong Two& Step 2B:
The examiner notes that claim 2 does not recite any additional elements, much less additional elements that integrate a judicial exception into a practical application or amount to significantly more than the claimed judicial exception. 
Therefore, claim 2 fails both Step 2A Prong Two and Step 2B and is thus directed to patent ineligible subject matter. 

With respect to claim 3, claim 3 recites the judicial exception as shown in the following bolded limitations. Claim 13 also recites identical and/or substantially similarly limitations as claim 3 and is rejected in the same manner, the same art and reasoning applying. 
determining whether a size of the first array in a first dimension is greater than a size of the PE array in the first dimension; and
in response to determining that the size of the first array in the first dimension is greater than the size of the PE array in the first dimension:
partitioning the first array into the first one and a second one of the subarrays; and

Step 2A Prong One:
          Regarding the limitation determining whether a size of the first array in a first dimension is greater than a size of the PE array in the first dimension: (mathematical algorithm and mental process: the examiner notes that this limitation is merely directed to a mathematical algorithm/concept that compares two numbers (e.g., dimensions of two arrays) with each other to find the greater or smaller from the numbers. See MPEP § 2106.04(a)(1). In addition, the examiner notes that this limitation is also directed to a mental process that can be performed by a human analog who evaluates the dimensions of two matrices or arrays to determine which dimension is larger.  See MPEP § 2106.04(a). 
          Regarding the limitation in response to determining that the size of the first array in the first dimension is greater than the size of the PE array in the first dimension: partitioning the first array into the first one and a second one of the subarrays: (mathematical algorithm and mental process: the examiner notes that this limitation is also directed to a mathematical algorithm that partitions an array/matrix. See MPEP § 2106.04(a)(1). In addition, the examiner notes that this limitation is also directed to a mental process that can be performed by a human analog who, after mentally determining which array/matrix is larger, partitions the larger matrix or array into multiple sub-arrays/matrices.  See MPEP § 2106.04(a).)
          Regarding the limitation in response to determining that the size of the first array in the first dimension is greater than the size of the PE array in the first dimension: assigning the first subarray to the first group of PEs in the PE array, and the second subarray to a second group of PEs in the PE array, wherein positions of the first group of PEs in a second dimension different from the first dimension are different from positions of the second group of PEs in the second dimension: (mathematical algorithm and mental process: the examiner notes that this limitation is again directed to a mathematical algorithm that correlates one element (e.g., the claimed first subarray) with another element (e.g., the recited “group of PEs”). See MPEP § 2106.04(a)(1). In addition, the examiner notes that this limitation is also directed to a mental process that can be performed by a human analog who sends data to a particular solution process (e.g., “group of PEs”) for processing.  See MPEP § 2106.04(a).)
	Therefore, the examiner asserts that claim 2 merely recites an abstract idea of a mathematical algorithm / principle and/or a mental process and thus fails Step 2A Prong One. 

Step 2A Prong Two& Step 2B:
The examiner notes that claim 2 does not recite any additional elements, much less additional elements that integrate a judicial exception into a practical application or amount to significantly more than the claimed judicial exception. 


With respect to claim 4, claim 4 recites the judicial exception as shown in the following bolded limitations. Claim 14 also recites identical and/or substantially similarly limitations as claim 4 and is rejected in the same manner, the same art and reasoning applying. 
identifying a common portion of the input data to be used by both the first and second groups of PEs; and
shifting the common portion of the input data into the first and second groups of PEs.
Step 2A Prong One:
          Regarding the limitation identifying a common portion of the input data to be used by both the first and second groups of PEs: (mental process: the examiner notes that this limitation is directed to a mental process that can be performed by a human analog who observes the inputs for two or more solution processes and identifies the input that is needed for these two or more solution processes.  See MPEP § 2106.04(a).) 
          Regarding the limitation shifting the common portion of the input data into the first and second groups of PEs: (mental process: the examiner notes that this limitation is also directed a mental process that can be performed by a human analog who, after mentally determining which input is needed for the two or more solution processes, provides the input to these two or more solution processes that may also be another mental process for the human analog to carry out.  See MPEP § 2106.04(a).)
	Therefore, the examiner asserts that claim 4 merely recites an abstract idea of a mathematical algorithm / principle and/or a mental process and thus fails Step 2A Prong One. 

Step 2A Prong Two& Step 2B:
The examiner notes that claim 4 does not recite any additional elements, much less additional elements that integrate a judicial exception into a practical application or amount to significantly more than the claimed judicial exception. 
Therefore, claim 4 fails both Step 2A Prong Two and Step 2B and is thus directed to patent ineligible subject matter. 

With respect to claim 5, claim 5 recites the judicial exception as shown in the following bolded limitations. Claim 15 also recites identical and/or substantially similarly limitations as claim 5 and is rejected in the same manner, the same art and reasoning applying. 
identifying a first portion of the input data to be used by the first group of PEs, and a second portion of the input data to be used by the second group of PEs;
shifting the first portion of the input data into the first group of PEs; and
shifting the second portion of the input data into the second group of PEs.
Step 2A Prong One:
          Regarding the limitation identifying a first portion of the input data to be used by the first group of PEs, and a second portion of the input data to be used by the second group of PEs: (mental process: the examiner notes that this limitation is directed to a mental process that can be performed by a human analog who observes and/or evaluates two portions of inputs respectively used by two solution processes.  See MPEP § 2106.04(a).) 
          Regarding the limitation shifting the first portion of the input data into the first group of PEs; and shifting the second portion of the input data into the second group of PEs: (mental process: the examiner notes that this limitation is also directed a mental process that can be performed by a human analog who, after mentally observing and/or evaluating respective inputs for the aforementioned two solution processes, provides or uses a first portion of the observed input in one of the two solution processes and another portion of the observed input to the other solution process.  See MPEP § 2106.04(a).)
	Therefore, the examiner asserts that claim 5 merely recites an abstract idea of a mathematical algorithm / principle and/or a mental process and thus fails Step 2A Prong One. 

Step 2A Prong Two& Step 2B:
The examiner notes that claim 5 does not recite any additional elements, much less additional elements that integrate a judicial exception into a practical application or amount to significantly more than the claimed judicial exception. 


With respect to claim 6, claim 6 recites judicial exception as shown in the following bolded limitations. Claim 16 also recites identical and/or substantially similarly limitations as claim 6 and is rejected in the same manner, the same art and reasoning applying. 
wherein the plurality of outputs are outputs of convolution operations for the layer of the neural network.
Step 2A Prong One:
          Regarding the limitation wherein the plurality of outputs are outputs of convolution operations for the layer of the neural network: (The examiner notes that this limitation merely recites an additional element that is analyzed in Step 2A Prong Two. )
	 

Step 2A Prong Two:
Regarding the additional element wherein the plurality of outputs are outputs of convolution operations for the layer of the neural network: (The examiner notes that this additional element fails to integrate the claimed judicial exception into a practical application. More specifically, this additional element merely generally links the use of a judicial exception to a particular technological environment or field of use (e.g., convolution operations of a layer in a neural network).  This claim 6 directed to a judicial exception cannot be made eligible "simply by having the applicant acquiesce to limiting the reach of the patent for the formula to a particular technological use. See MPEP § 2106.05(h). Therefore, claim 6 fails to integrate a judicial exception into a practical application to satisfy Step 2A Prong Two. See MPEP § 2106.05(h).) 

Step 2B: 
	Regarding the additional element wherein the plurality of outputs are outputs of convolution operations for the layer of the neural network: (The examiner notes that this additional element also fails to amount to significantly more than the claimed judicial exception. As discussed above with respect to integration of the judicial exception into a practical application, this additional element merely generally links the use of a judicial exception to a particular technological environment or field of use.  Therefore, limitations that amount to merely indicating a field of use or technological environment in which to apply a judicial exception do not amount to significantly more than the exception itself. See MPEP § 2106.05(h).) 
As such, the above additional element in claim 6 fails to amount to significantly more than the claimed judicial exception to satisfy Step 2B and is thus directed to patent ineligible subject matter. 


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any 

9.	Claim(s) 1 and 11 is/are rejected under 35 U.S.C. 103 as being unpatentable over Song et al., HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array (January 7, 2019) (hereinafter Song).
With respect to claim 1, Song teaches: 
A method comprising: receiving, by a processor, input data to generate a plurality of outputs for a layer of a neural network, the plurality of outputs being arranged in a first array; (Song at ¶ 3, § I, p. 1: “To achieve high performance and energy efficiency, hardware acceleration of DNNs is intensively studied both in academia [8–90] and industry [91–101]. In particular, several major companies developed 1) DNN accelerators, e.g., Google TPU [91,92], and neuro-processors, e.g., IBMTrueNorth [93–95]; 2) corresponding standards, architectures, and platforms [96, 98–100].” ¶ 1, § 2.1, p. 2: “The inference of deep neural networks is a forward progress of input data (typically images) from the first layer to the last layer.” ¶ § 3.1.1, p. 4: “For a convolutional layer l, we use Fl to represent feature maps of this layer”; and “The size of the feature map slice is [ Hl × Wl × Cl]. Thus, Fl is of size B ×[ Hl × Wl × Cl ]. The kernel Wl has a size of [ K ×K × Cl ] × Cl + 1, where K is the height/width of kernels and Cl+1 is the number of channels of next layer, Layer l +1. f (·) is an activation function, performing element-wise non-linear operations. We use ⊗ to denote convolutions. The inference (forward propagation) can be represented as, Fl+1 = f (Fl ⊗ Wl)     (1).” Last paragraph, left-hand column, p. 2: “we propose a solution HYPAR to determine layer-wise parallelism for deep neural network training with an array of DNN accelerators.”
The examiner first notes that Song’s Google TPU, IBM TrueNorth, etc. renders the claimed processor obvious.  The examiner further notes that the input data that is used by Song’s HyPar to produce the feature map tensor(s) (Fl and/or Fl+1) teaches receiving input data by a processor, that Song’s feature map tensor(s) (Fl and/or Fl+1) teaches a plurality of output arranged in an array with dimensions of, for example, [ Hl × Wl × Cl] for (s) Fl.)

comparing, by the processor, dimensions of the first array with dimensions of a processing unit (PE) array comprising a plurality of PEs; (Song, p. 4, § 3.1, ¶ 1: “Assume we have two accelerators, the batch size is B = 32. Let us consider a fully-connected layer, where the number of input and output neurons are 70 and 100, respectively. Thus, the feature map Fl has a size of 32 ×70, the kernel (weight matrix) has a size of 70 ×100 and Fl+1 has a size of 32 ×100.” ¶ 3, § 3.1.1, p. 4: “For the one holding the rectangles with shadow lines, it performs a multiplication with a size of [16 ×70] → [70 ×100] ⇒ [16 ×100].” ¶ 1, § 3.1.2, p. 4: “In model parallelism, the kernel is partitioned, and feature maps are partitioned accordingly. In forward, each accelerator performs computation for the matrices Fl →Wl ⇒Fl+1 with sizes of [32 ×35] → [35 ×100] ⇒ [32 ×100].” FIG. 4(a)-(d): Caption on § 4, p. 7: “Figure 4: Overall view of (a) an HMC-based accelerator, (b) a row stationary processing unit, (c) an array of sixteen accelerators in H tree, and (d) the accelerator array in torus.”

    PNG
    media_image1.png
    429
    394
    media_image1.png
    Greyscale

The examiner first notes that Song’s array of accelerators as shown in FIG. 4(a), (b), (c), and/or (d) teaches a PE array.  The examiner further notes that one or more neurons (e.g., a portion of a model in Song’s model parallelism) in an accelerator, an accelerator in Song’s multi-accelerator architecture, or a portion of an accelerator having multiple neurons and holding a corresponding portion of the model teaches a processing element, and that Song’s PE array thus comprises a plurality of PEs. The examiner also notes that Song’s dimensions of the output feature tensor(s)  (e.g., the feature map tensor Fl+1) teach dimensions of the first array, that Song’s kernel size (e.g., the original [70 x 100] dimensions with 70 input neurons and 100 output neurons or the halved [35 x 100] dimensions) teaches dimensions of the PE array. 
The examiner further notes that Song’s § 3.1 cited above teaches the example where given the output feature map Fl+1 having certain dimensions, comparing dimensions ([32 x 100]) of the output feature map Fl+1 and dimensions of the input feature map Fl ([32 x 70]) to determine whether and how to implement data parallelism (by using the input [16 x 70] as required by the output dimensions 32 x 100 and the number of input neurons) and model parallelism (e.g., by using the input of 32 x 35 as required by the output dimensions of 32 x 100 and one-half of input neurons) as described in §§ 3.1.1 and 3.1.2, supra.  Therefore, Song teaches the above limitation.)
 
partitioning, by the processor according to a result of the comparing, the first array into subarrays each having dimensions less than or equal to the dimensions of the PE array; (Song, p. 4, § 3, ¶ 1: “Assume we have two accelerators, the batch size is B = 32. Let us consider a fully-connected layer, where the number of input and output neurons are 70 and 100, respectively. Thus, the feature map Fl has a size of 32 ×70, the kernel (weight matrix) has a size of 70 ×100 and Fl+1 has a size of 32 ×100.”
FIG. 1(a) and § 3.1.1, p. 4 “Data Parallelism”:

    PNG
    media_image2.png
    200
    400
    media_image2.png
    Greyscale

          p. 4, § 3.1.1, ¶ 3: “For the one holding the rectangles with shadow lines, it performs a multiplication with a size of [16 ×70] → [70 ×100] ⇒ [16 ×100].”  The examiner first notes that Song’s accelerator or the two-accelerator architecture for model parallelism teaches a PE array.  The examiner also notes that Song teaches model parallelism and data parallelism by comparing dimensions of PE array with dimensions of the first array (see Song, supra).  The examiner further notes that the cited figure and passages teach that Song partitions the first array originally having dimensions of [32 x 100] into two [16 x 100] subarrays such that Song’s data parallelism described in § 3.1.1 may use two PE arrays each of which processing the respective input data. Further, the respective input data has dimensions of [16 x 70] due to halving the input data for partitioning the first array.  Moreover, a corresponding kernel having the dimension of [70 x 100] is determined so that the product produces [16 x 100] (e.g., [16 x 70] x [70 x 100] = [16 x 100]) as partitioned.  The examiner also notes that in Song’s data parallelism, the PE array uses 70 input neurons (hence [16 x 70] for the input) and 100 output neurons (hence [70 x 100] for the kernel) and hence has the dimensions of [70 x 100] that is larger than the partitioned output subarray of [16 x 100].  Therefore, each subarray has dimensions less than or equal to the dimensions of the PE array. Therefore, the examiner asserts that Song teaches the above limitation.)
 
assigning, by the processor, a first group of PEs in the PE array, to a first one of the subarrays; and (Song, p. 4, § 3.1.1, ¶ 3: “In forward, each accelerator performs the computation in Equation 1”; and “For the one holding the rectangles with shadow lines, it performs a multiplication with a size of [16 ×70] → [70 ×100] ⇒ [16 ×100].” The examiner notes that halving the feature map tensor (Fl) from the original dimensions [32 x 70] into two input having the dimensions of [16 x 70] and respectively processing these two [16 x 70] inputs with Song’s two accelerators to generate two [16 x 100] outputs teaches assigning a first group of PEs (e.g., one of the two accelerators) to a first one of the subarrays (e.g., one-half of the feature map tensor Fl+1).)

generating, by each PE of the first group of PEs assigned to the first one of the subarrays, a corresponding output of the plurality of outputs using a portion of the input data. (Song at Eq. 1, p. 2: Fl+1 = f (Fl ⊗ Wl)      (1).   ¶ 3, § 3.1.1, p. 4 cited immediately: ““In forward, each accelerator performs the computation in Equation 1”; and “For the one holding the rectangles with shadow lines, it performs a multiplication with a size of [16 ×70] → [70 ×100] ⇒[16 ×100].”¶ 1, § 3.1.2, p. 4 cited immediately above: “In model parallelism, the kernel is partitioned, and feature maps are partitioned accordingly. In forward, each accelerator performs computation for the matrices Fl →Wl ⇒Fl+1 with sizes of [32 ×35] → [35 ×100] ⇒ [32 ×100].” The examiner first notes that one or more neurons, an accelerator in Song’s multi-accelerator architecture, or a portion of an accelerator teaches a PE that receives the corresponding input from the input feature map Fl.  The examiner further notes that Song’s multi-accelerator architecture or a portion thereof (e.g., a processing unit as shown in FIG. 4(b), an accelerator having multiple neurons as shown in FIG. 4(c), etc.) teaches a PE array which thus comprises a plurality of PEs. The examiner thus notes that Song thus teaches each PE in a first group is assigned to a first subarray such as the subarray that processes the input having a size of [16 x 70].  The examiner further notes that each processing unit in Song’s accelerator performs the matrix or tensorial product (⊗) of a row in the feature map tensor (Fl) and a column (Wl) according to Eq. (1) to generate and populate an output element into a corresponding portion (e.g., the first one of the subarrays) in the output array (e.g., the output feature map Fl+1) and thus teaches the above limitation.)

With respect to claim 11, Song teaches: 
A device comprising: 
a processor; and (Song, p. 1, § I, ¶ 3: “To achieve high performance and energy efficiency, hardware acceleration of DNNs is intensively studied both in academia [8–90] and industry [91–101]. In particular, several major companies developed 1) DNN accelerators, e.g., Google TPU [91,92], and neuro-processors, e.g., IBMTrueNorth [93–95]; 2) corresponding standards, architectures, and platforms [96, 98–100].” The examiner first notes that Song’s Google TPU, IBM TrueNorth, etc. renders the claimed processor obvious.)
a processing unit (PE) array comprising a plurality of PEs, wherein the processor is configured to: (Song, p. 1, Abstract, ¶ 2: “In this paper, we propose a solution HYPAR to determine layer-wise parallelism for deep neural network training with an array of DNN accelerators.” P. 4, § 3.1, ¶ 1: “Assume we have two accelerators, the batch size is B = 32. Let us consider a fully-connected layer, where the number of input and output neurons are 70 and 100, respectively. Thus, the feature map Fl has a size of 32 ×70, the kernel (weight matrix) has a size of 70 ×100 and Fl+1 has a size of 32 ×100.” ¶ 3, § 3.1.1, p. 4: “For the one holding the rectangles with shadow lines, it performs a multiplication with a size of [16 ×70] → [70 ×100] ⇒ [16 ×100].” ¶ 1, § 3.1.2, p. 4: “In model parallelism, the kernel is partitioned, and feature maps are partitioned accordingly. In forward, each accelerator performs computation for the matrices Fl →Wl ⇒Fl+1 with sizes of [32 ×35] → [35 ×100] ⇒ [32 ×100].” FIG. 4(a)-(d): Caption on § 4, p. 7: “Figure 4: Overall view of (a) an HMC-based accelerator, (b) a row stationary processing unit, (c) an array of sixteen accelerators in H tree, and (d) the accelerator array in torus.”

    PNG
    media_image1.png
    429
    394
    media_image1.png
    Greyscale

The examiner first notes that Song’s array of accelerators as shown in FIG. 4(a), (b), (c), and/or (d) teaches a PE array.  The examiner further notes that one or more neurons (e.g., a portion of a model in Song’s model parallelism) in an accelerator, an accelerator in Song’s multi-accelerator architecture, OR a portion of an accelerator having multiple neurons and holding a corresponding portion of the model for Song’s model parallelism teaches a processing element, and that Song’s PE array thus comprises a plurality of PEs as recited. The examiner notes that Song’s one or more neurons (e.g., one or more neurons pertaining to a portion of a model in Song’s model parallelism) in an accelerator, an accelerator, or a portion thereof (e.g., via partitioning) having a plurality of neurons and a kernel teaches a PE, and that Song’s PE array thus comprises a plurality of PEs.)
receive input data to generate a plurality of outputs for a layer of a neural network, the plurality of outputs being arranged in a first array; (Song, p. 2, § 2.1, ¶ 1: “The inference of deep neural networks is a forward progress of input data (typically images) from the first layer to the last layer.” ¶ § 3.1.1, p. 4: “For a convolutional layer l, we use Fl to represent feature maps of this layer”; and “The size of the feature map slice is [ Hl × Wl × Cl]. Thus, Fl is of size B ×[ Hl × Wl × Cl ]. The kernel Wl has a size of [ K ×K × Cl ] × Cl + 1, where K is the height/width of kernels and Cl+1 is the number of channels of next layer, Layer l +1. f (·) is an activation function, performing element-wise non-linear operations. We use ⊗ to denote convolutions. The inference (forward propagation) can be represented as, Fl+1 = f (Fl ⊗ Wl)     (1).” Last paragraph, left-hand column, p. 2: “we propose a solution HYPAR to determine layer-wise parallelism for deep neural network training with an array of DNN accelerators.”
The examiner notes that the input data that is used by Song’s HyPar to produce the feature map tensor(s) (Fl and/or Fl+1) teaches receiving input data, that Song’s feature map tensor(s) (Fl and/or Fl+1) teaches a plurality of outputs arranged in an array with dimensions of, for example, [ Hl × Wl × Cl ] for (s) Fl.)
compare dimensions of the first array with dimensions of the PE array; (Song, p. 4, § 3, ¶ 1: “Assume we have two accelerators, the batch size is B = 32. Let us consider a fully-connected layer, where the number of input and output neurons are 70 and 100, respectively. Thus, the feature map Fl has a size of 32 ×70, the kernel (weight matrix) has a size of 70 ×100 and Fl+1 has a size of 32 ×100.” ¶ 3, § 3.1.1, p. 4: “For the one holding the rectangles with shadow lines, it performs a multiplication with a size of [16 ×70] → [70 ×100] ⇒ [16 ×100].” ¶ 1, § 3.1.2, p. 4: “In model parallelism, the kernel is partitioned, and feature maps are partitioned accordingly. In forward, each accelerator performs computation for the matrices Fl →Wl ⇒Fl+1 with sizes of [32 ×35] → [35 ×100] ⇒ [32 ×100].” FIG. 4(a)-(d): Caption on § 4, p. 7: “Figure 4: Overall view of (a) an HMC-based accelerator, (b) a row stationary processing unit, (c) an array of sixteen accelerators in H tree, and (d) the accelerator array in torus.”

    PNG
    media_image1.png
    429
    394
    media_image1.png
    Greyscale

The examiner first notes that Song’s array of accelerators as shown in FIG. 4(a), (b), (c), and/or (d) teaches a PE array.  The examiner also notes that Song’s one or more neurons, an accelerator having multiple input and output neurons and a kernel for convolution, a portion of an accelerator, or Song’s two-accelerator architecture teaches a PE, and that Song’s PE array thus comprises a plurality of PEs.  The examiner further notes that Song’s dimensions of the first array (e.g., the output feature map tensor Fl+1) teach dimensions of the first array, that Song’s kernel size (e.g., the original [70 x 100] dimensions or the halved [35 x 100] dimensions) teaches dimensions of the PE array. 
 The examiner further notes that Song’s § 3.1 cited above teaches the example where given the output feature map Fl+1 having certain dimensions, comparing dimensions ([32 x 100]) of the output feature map Fl+1 and dimensions of the input feature map Fl ([32 x 70]) to determine whether and how to implement data parallelism (e.g., by using the input [16 x 70] as required by the output dimensions 32 x 100 and the number of input neurons) and model parallelism (e.g., by using the input of 32 x 35 as required by the output dimensions of 32 x 100 and one-half of input neurons) as described in §§ 3.1.1 and 3.1.2, supra.  Therefore, Song teaches the above limitation.)
partition, according to a result of the comparing, the first array into subarrays each having dimensions less than or equal to the dimensions of the PE array; and (Song, p. 4, § 3, ¶ 1: “Assume we have two accelerators, the batch size is B = 32. Let us consider a fully-connected layer, where the number of input and output neurons are 70 and 100, respectively. Thus, the feature map Fl has a size of 32 ×70, the kernel (weight matrix) has a size of 70 ×100 and Fl+1 has a size of 32 ×100.”
FIG. 1(a) and § 3.1.1, p. 4 “Data Parallelism”:

    PNG
    media_image2.png
    200
    400
    media_image2.png
    Greyscale

p. 4, § 3.1.1, ¶ 3: “For the one holding the rectangles with shadow lines, it performs a multiplication with a size of [16 ×70] → [70 ×100] ⇒ [16 ×100].”  
 The examiner first notes that Song’s accelerator or the two accelerators connected for parallelism teach a PE array. The examiner also notes that Song teaches model parallelism and data parallelism by comparing dimensions of PE array with dimensions of the first array (see Song, supra).  The examiner further notes that the cited figure and passages teach that Song partitions the first array originally having dimensions of [32 x 100] into two [16 x 100] subarrays such that Song’s data parallelism described in § 3.1.1 may use two PE arrays each of which processing the respective input data. Further, the respective input data has dimensions of [16 x 70] due to halving the input data for partitioning the first array.  Moreover, a corresponding kernel having the dimension of [70 x 100] is determined so that the product produces [16 x 100] (e.g., [16 x 70] x [70 x 100] = [16 x 100]) as partitioned.  The examiner also notes that in Song’s data parallelism, the PE array uses 70 input neurons (hence [16 x 70] for the input) and 100 output neurons (hence [70 x 100] for the kernel) and hence has the dimensions of [70 x 100] that is larger than the partitioned output subarray of [16 x 100].  Therefore, each subarray has dimensions less than or equal to the dimensions of the PE array. Therefore, the examiner asserts that Song teaches the above limitation.)
assign a first group of PEs in the PE array, to a first one of the subarrays, and (Song at p. 4, § 3.1.1, ¶ 3: “In forward, each accelerator performs the computation in Equation 1”; and “For the one holding the rectangles with shadow lines, it performs a multiplication with a size of [16 ×70] → [70 ×100] ⇒[16 ×100].” The examiner notes that halving the feature map tensor (Fl) from the original dimensions [32 x 70] into two inputs each having the dimensions of [16 x 70] and respectively processing these two [16 x 70] inputs with Song’s two accelerators to generate two [16 x 100] output and to populate the respective portions of the output array (e.g., the “first array”) teaches assigning a first group of PEs (e.g., one pf the two accelerators) to a first one of the subarrays (e.g., one-half of the feature map tensor Fl+1).)
wherein each PE of the first group of PEs assigned to the first one of the subarrays, is configured to generate a corresponding output of the plurality of outputs using a portion of the input data. (Song at Eq. 1, p. 2: Fl+1 = f (Fl ⊗ Wl)      (1).  ¶ 3, § 3.1.1, p. 4 cited immediately: ““In forward, each accelerator performs the computation in Equation 1”; and “For the one holding the rectangles with shadow lines, it performs a multiplication with a size of [16 ×70] → [70 ×100] ⇒[16 ×100]” ¶ 1, § 3.1.2, p. 4 cited immediately above: “In model parallelism, the kernel is partitioned, and feature maps are partitioned accordingly. In forward, each accelerator performs computation for the matrices Fl →Wl ⇒Fl+1 with sizes of [32 ×35] → [35 ×100] ⇒ [32 ×100].”
The examiner first notes that one or more neurons, an accelerator in Song’s multi-accelerator architecture, or a portion of an accelerator teaches a PE that receives the corresponding input from the input feature map Fl.  The examiner further notes that Song’s multi-accelerator architecture or a portion thereof (e.g., a processing unit as shown in FIG. 4(b), an accelerator having multiple neurons as shown in FIG. 4(c), etc.) teaches a PE array which thus comprises a plurality of PEs. The examiner thus notes that Song thus teaches each PE in a first group is assigned to a first subarray such as the subarray that processes the input having a size of [16 x 70].  The examiner further notes that each PE in Song’s accelerator performs its respective operation for the matrix or tensorial product (⊗) of a row in the input feature map tensor Fl and a column in the kernel or weight (Wl) according to Eq. (1) to generate and populate an output element into a corresponding portion (e.g., the first one of the subarrays) of the output array (e.g., the output feature map Fl+1) and thus teaches the above limitation.)

(s) 2-8, 10, 12-18, and 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Song et al., HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array (January 7, 2019) (hereinafter Song) in view of Lu et al., FlexFlow: A Flexible Dataﬂow Accelerator Architecture for Convolutional Neural Networks (2017) (hereinafter Lu).
 
With respect to 2, Song teaches the method according to claim 1 but does not appear to explicitly teach:  
wherein: the PE array is a two-dimensional M×N array wherein each of M and N is an integer greater than 1, and 
the partitioning of the first array comprises partitioning the first array into one or more of an M×N array, an M×N/2 array, an M/2×N array, or an M/2×N/2 array.
Lu does, however, teach: 
wherein: the PE array is a two-dimensional M×N array wherein each of M and N is an integer greater than 1, and (Lu at ¶ 1, § 3.2, p. 555: “Parallelism. It reaps neuron parallelism (NP), two neuron related loops r and c are unrolled (Tr = 3, Tc = 3), as Figure 5(b1) shown.” ¶ 1, § 3.3, p. 556: “Parallelism. It reaps feature map parallelism (FP), and two feature map related loops m and n are unrolled (Tm = 4, Tn = 2)”. FIG. 5(b)(2) and 5(c)(2): 

    PNG
    media_image3.png
    256
    528
    media_image3.png
    Greyscale

The examiner notes that each of Lu’s Tr x Tc PE array illustrated in FIG. 5(b)(2) where Tr = 3 and Tc = 3 as well as the Tm x Tn PE array illustrated in FIG. 5(c)(2) where Tm = 4 and Tn = 2 teaches an MxN array where each of M and N is an integer greater than 1 as claimed.)
the partitioning of the first array comprises partitioning the first array into one or more of an M×N array, an M×N/2 array, an M/2×N array, or an M/2×N/2 array. (Lu at ¶ 3, § 2.2, p. 554: “Feature map Parallelism (FP), the feature map related loops m and n are unrolled with factors < Tm, Tn>. Tm output feature maps and Tn input feature maps are processed at a time.” ¶ 4, § 4.3, p. 559: “Moreover, the parallelism logically divides the PE array into Tm x Tn groups, and each group includes (Ti x Tj) x (Tr x Tc) PEs.”  § 5, p. 560: “constraints (1)”:

    PNG
    media_image4.png
    200
    400
    media_image4.png
    Greyscale

Where Tm and Tn denotes the tiling unrolling factors that enable Lu’s neural network to process Tm output neurons and Tn input neurons in a single clock cycle; N denotes the number of input feature maps; M denotes the number of output feature maps; K denotes the size of the kernel; and K’ denotes the kernel size of next CONV layer. ¶ 2, § 4.2, p. 558: “The complementary effect is better explained with an example, as shown in Figure 8. For C1 layer, we use high SP (Tj = 4) to complement low FP (Tn = 1) occupying the intra-row PEs. And medium NP (Tc = 2) and FP (Tm = 2) are combined to occupy all PE rows. Overall the unrolling mode for C1 should be < Tm = 2, Tr = 1, Tc = 2, Tn = 1, Ti = 1, Tj = 4 >. Similarly, for C2 layer, the unrolling mode was configured to <Tm = 2, Tr = 1, Tc = 2, Tn = 2, Ti = 1, Tj = 2>. By doing so, the PEs for both C1 and C2 are fully utilized.” .  § 4.3, last paragraph, p. 559: “complementary parallelism principle logically divides the PE array into Tm × Tn groups, and each group includes (Ti × Tj ) × (Tr × Tc ) PEs.”
The examiner notes that Tm and Tn  respectively indicate that Tm number of output feature maps and Tn number of input feature maps are processed at a time.  See Lu at ¶ 3, § 2.2, p. 554. That is, an input feature map in Song is divided into Tn number of input sub-maps each of which is processed by a processing element (PE) array (e.g., a processing array or accelerator or a subarray thereof in Song’s architecture) to fully utilize the processing elements. Therefore, Lu divides a PE array into Tm × Tn groups because each group of processing elements processes one input feature map at a time to generate its respective output for a total of Tm × Tn outputs from the Tm × Tn PE subarrays.  Moreover, Lu teaches that Tm and Tn may have the value of 1, 2, 16, etc.  See Lu at second paragraph, left-hand column, § 4.2, p. 558.  Therefore, Tn = 1 and Tm = 1 teaches dividing the first array into M/1 x N/1 and hence M x N for the first array; Tn = 1 and Tm = 2 teaches dividing the first array into M/2 x N/1 similar to Song’s data parallelism and hence M/2 x N for the first array; Tn = 2 and Tm = 1 teaches dividing the first array into M/1 x N/2 and hence M x N/2 for the first array; and Tn = 2 and Tm = 2 teaches dividing the first array into M/2 x N/2 and hence M/2 x N/2 for the first array.  Therefore, Lu teaches the above limitation.)
Song and Lu are analogous art because both pertain to parallelism for neural networks with an array of neural network accelerators.  
It would have been obvious for a person of ordinary skill in the art prior to the effective filing date to combine Song’s “HyPar” or hybrid parallelism (see Song, supra) with Lu’s  partitioning a PE array having the dimensions greater than 1 x 1 (see Lu, supra).  The modification not only allows explicit computation of computing resource utilization under any feasible unrolling/partitioning factors through Lu’s FlexFlow partitioning mechanism (Lu at ¶ 2, § 5, p. 560: “Given a CONV layer and a convolutional unit, we can get the computing resource utilization under any feasible unrolling factors. We use PE cycle to portray the resource utilization. The number of PE cycles for computation to the total number of PE cycles ratio, features the computing resource utilization. For simplicity, we compute PE row utilization (Ur) and PE column utilization (Uc) of the convolutional unit, which are calculated by Equation 2 and Equation 3. The total utilization (Ut) can be calculated by Ur × Uc.”) but also provides superior computing resource utilization (Lu at ¶ 1, § 6.2.2, p. 561: “Computing resource utilization of each baseline is shown in Figure 15. FlexFlow obtains over 80% resource utilization across all workloads. Retaining flexible dataflows which reaping mixture of parallelisms directly contribute to the superior resource utilization.”) 
 
With respect to claim 3, Song modified by Lu teaches the method according to claim 1, and Song further teaches: 
assigning the first subarray to the first group of PEs in the PE array, and the second subarray to a second group of PEs in the PE array, (Song at ¶ 1, § 3, p. 4: “Assume we have two accelerators, the batch size is B = 32. Let us consider a fully-connected layer, where the number of input and output neurons are 70 and 100, respectively. Thus, the feature map Fl has a size of 32 ×70, the kernel (weight matrix) has a size of 70 ×100 and Fl+1 has a size of 32 ×100.” ¶ 1, § 3.1.2, p. 4: “In model parallelism, the kernel is partitioned, and feature maps are partitioned accordingly. In forward, each accelerator performs computation for the matrices Fl →Wl ⇒Fl+1 with sizes of [32 ×35] → [35 ×100] ⇒ [32 ×100].”  FIG. 1, p. 4: 

    PNG
    media_image5.png
    357
    854
    media_image5.png
    Greyscale

The examiner notes that Song’s teaching that each portion of an accelerator (where the kernel is partitioned from [70 x 100] to two different instances of [35 x 100] as shown in FIG. 1(b)) performs the above computation (convolution) for Fl with the input having the size of [32x35] (from the input of [32 x 70] for a batch size of 32 and 70 input neurons) and the corresponding kernel (e.g., kernel having the size of [35 x 100] in § 3.1.2 and FIG. 1(b)) to generate the convolution output (e.g., the output having the dimensions of [32 x 100] in § 3.1.2 and FIG. 1(b)) while the remaining portion (e.g., the remaining input neurons, the corresponding kernel, and the convolution of the input feature map and the corresponding kernel) of the accelerator may similarly process another input feature map to generate another output teaches that each accelerator having multiple processing units or elements is assigned a respective sub-array.)
wherein positions of the first group of PEs in a second dimension different from the first dimension are different from positions of the second group of PEs in the second dimension. (Song at ¶ 2, § 5, p. 7: “For the PUs, as shown in Figure 4 (b), we implement a row stationary design as [15]. In such design, weight rows (green) are shared by processing engines horizontally, feature map rows (blue) are shared by processing engines diagonally, and partial sum rows (red) are accumulated vertically.” ¶ 2, § 6, p. 8: “The PUs used in the evaluation have an Eyeriss-like [15] row stationary architecture, and each processing unit has 168 (12 × 14) processing engines”; and FIG. 4(b) Row stationary PU:

    PNG
    media_image6.png
    200
    400
    media_image6.png
    Greyscale

The examiner notes that each of Song’s two accelerator has multiple processing engines, and that each of Song’s two accelerator and each processing engine are located at different positions.)
Song does not appear to explicitly teach: 
determining whether a size of the first array in a first dimension is greater than a size of the PE array in the first dimension; and 
in response to determining that the size of the first array in the first dimension is greater than the size of the PE array in the first dimension: 
partitioning the first array into the first one and a second one of the subarrays; and 

Lu does, however, teach: 
(Lu at ¶ 3, § 2.2, p. 554: “Feature map Parallelism (FP), the feature map related loops m and n are unrolled with factors < Tm, Tn>. Tm output feature maps and Tn input feature maps are processed at a time.” ¶ 4, § 4.3, p. 559: “Moreover, the parallelism logically divides the PE array into Tm x Tn groups, and each group includes (Ti x Tj) x (Tr x Tc) PEs.”  ¶ 1, § 5, p. 560:  “In this section, we describe how to determine the parallel type and degree for each CONV layer, i.e. to determine the unrolling factors < Tm, Tn, Tr, Tc, Ti, Tj >. Given a CONV layer, assume that the number of output feature maps is M, the number of input feature maps is N, the size of one output feature map is S, the size of one kernel is K, the kernel size of next CONV layer is K’, and the pooling window size of next POOL is P”; and “If mapping the CONV layer to a convolutional unit with D×D PEs, the space of all feasible factors should meet the following constraints.”

    PNG
    media_image4.png
    200
    400
    media_image4.png
    Greyscale

¶ 2, § 4.2, p. 558: “The complementary effect is better explained with an example, as shown in Figure 8. For C1 layer, we use high SP (Tj = 4) to complement low FP (Tn = 1) occupying the intra-row PEs. And medium NP (Tc = 2) and FP (Tm = 2) are combined to occupy all PE rows. Overall the unrolling mode for C1 should be < Tm = 2, Tr = 1, Tc = 2, Tn = 1, Ti = 1, Tj = 4 >. Similarly, for C2 layer, the unrolling mode was configured to <Tm = 2, Tr = 1, Tc = 2, Tn = 2, Ti = 1, Tj = 2>. By doing so, the PEs for both C1 and C2 are fully utilized.”  § 4.3, last paragraph, p. 559: “complementary parallelism principle logically divides the PE array into Tm × Tn groups, and each group includes (Ti × Tj ) × (Tr × Tc ) PEs.” FIG. 13, p. 560: 

    PNG
    media_image7.png
    264
    937
    media_image7.png
    Greyscale

The examiner notes that when Lu determines that size of the first array is greater than the size of the PE array in a dimension (e.g., Tm for the first array (having the size Tm x Tn) in FIG. 13(e) > D for the PE array (having the size of D x D) before partitioning in FIG. 13(c)), Tm x Tr x Tc  <= D in Constraints (1) is violated because Tm > D yet both Tr and Tc are integers that are at least 1. In this case, Lu teaches partitioning the first array (and hence the PE array) into Tm x Tn groups each having (Ti x Tj ) x (Tr x Tc) PEs as shown in FIG. 13(e) and described in ¶ 2, § 4.2, p. 558 so that both Tn x Ti, x Tj  <= D and Tm x Tr x Tc  <= D in Constraints (1) are both satisfied.  Therefore, the examiner asserts that Lu teaches the above limitation.)
 
in response to determining that the size of the first array in the first dimension is greater than the size of the PE array in the first dimension: (Lu at ¶ 3, § 2.2, p. 554, ¶¶ 2-4, § 4.2, ¶ 4, § 4.3, p. 559, ¶ 1, § 5, p. 560, § 5, p. 560: “constraints (1)”.  ¶ 2, § 4.2, p. 558, supra teaches that Lu determines whether a size of the first array in a first dimension is greater than or equal to a size of the PE array in the first dimension “to determine the parallel type and degree for each CONV layer”.  See also FIG. 13 (reproduction of figure omitted). 
The examiner notes that Constraints (1) require the size of a first array in a first dimension (e.g., Tn x Ti x Tj for the vertical dimension of the output array as shown in FIG. 13(c)) is smaller than or equal to a dimension (e.g., “D”) of the PE array. The examiner further notes that Lu’s determining the unrolling factors pertaining to the input neurons (e.g., Tn x Ti x Tj as taught in ¶ 4, § 4.3 cited above) and output neurons (e.g., Tm x Tr x Tc as taught in ¶ 4, § 4.3 cited above) teaches comparing whether the size of the first array in a first dimension (Tn x Ti x Tj as shown in FIG. 13(e)) is greater than or equal to the size of the PE array (“D”) in the first dimension because Lu determines such unrolling factors (which pertains to, for example, the vertical dimension of the output array in FIG. 13(e)) to satisfy Constraints (1) pertaining to the size (“D x D”) of the PE array.  The examiner also notes that Lu’s determining the parallel type and degree by determining the aforementioned unrolling factors teaches determining parallelism in response to determining that the size of the first array in the first dimension is greater than the size of the PE array in the first dimension as recited. )
 
partitioning the first array into the first one and a second one of the subarrays; and (Lu at ¶ 4, § 4.3, p. 559: “Moreover, the parallelism logically divides the PE array into Tm x Tn groups, and each group includes (Ti x Tj) x (Tr x Tc) PEs.”  § 5, p. 560: “Constraints (1)”.  FIG. 13 (reproduction of figure omitted).  The examiner notes Constraints (1) require that when a size of the first array in a first dimension (e.g., the total input neuron (Tn x Ti x Tj) as shown in FIG. 13 (e) having a total number of rows of (Tn x Ti x Tj) that corresponds to the total number of input neurons) is greater than a dimension (D) of the PE array, Song satisfies Constraints (1) by determining the unrolling factor Tn (and/or Ti and Tj) to at most N or at most D (because the smallest integer for any of the unrolling factors is “1”), depending on whether D is greater than or smaller than N, so that Constraints (1) are satisfied.  Either way (D being larger or smaller than N), the input feature map will be partitioned into more than one sub-array because the total number of input neurons (Tn x Ti x Tj) is greater than one dimension of the PE array. Because each group having (Ti x Tj) processing elements for the input neurons and (Tr x Tc) for the output neurons (see ¶ 4, § 4.3, p. 559 cited above) generates and populates a corresponding output subarray (e.g., as shown in FIG. 13(e), Lu thus teaches partitioning the first array into a first one and a second one of the subarrays for a first group of PE and a second group of PE, respectively.)
Song and Lu are analogous art because both pertain to parallelism for neural networks with an array of neural network accelerators.  
It would have been obvious for a person of ordinary skill in the art prior to the effective filing date to combine Song’s “HyPar” or hybrid parallelism (see Song, supra) with Lu’s “ partitioning the first array into two separate subarrays when the size of the first array is determined to be greater than or equal to the size of the PE array in one dimension (see Lu, supra). The modification not only explicitly computes the resource utilization under any feasible unrolling factors with Lu’s FlexFlow partitioning mechanism (Lu at ¶ 2, § 5, p. 560: “Given a CONV layer and a convolutional unit, we can get the computing resource utilization under any feasible unrolling factors. We use PE cycle to portray the resource utilization. The number of PE cycles for computation to the total number of PE cycles ratio, features the computing resource utilization. For simplicity, we compute PE row utilization (Ur) and PE column utilization (Uc) of the convolutional unit, which are calculated by Equation 2 and Equation 3. The total utilization (Ut) can be calculated by Ur × Uc.”) but also provides superior computing resource utilization with the partitioning mechanism based on the aforementioned unrolling factors used for (Lu at ¶ 1, § 6.2.2, p. 561: “Computing resource utilization of each baseline is shown in Figure 15. FlexFlow obtains over 80% resource utilization across all workloads. Retaining flexible dataflows which reaping mixture of parallelisms directly contribute to the superior resource utilization.”)
 
With respect to claim 4, Song modified by Lu teaches the method according to claim 3, and Lu further teaches: 
identifying a common portion of the input data to be used by both the first and second groups of PEs; and (Lu, ¶ 2, ¶ 4.3, p. 558: “Neurons                         
                            
                                
                                    I
                                
                                
                                    (
                                    0
                                    ,
                                     
                                    0
                                    )
                                
                                
                                    (
                                    0
                                    )
                                
                            
                            ,
                             
                            
                                
                                    I
                                
                                
                                    (
                                    0
                                    ,
                                     
                                    1
                                    )
                                
                                
                                    (
                                    0
                                    )
                                
                            
                            ,
                             
                            
                                
                                    I
                                
                                
                                    (
                                    0
                                    ,
                                     
                                    2
                                    )
                                
                                
                                    (
                                    0
                                    )
                                
                            
                            ,
                             
                            
                                
                                    I
                                
                                
                                    (
                                    0
                                    ,
                                     
                                    3
                                    )
                                
                                
                                    (
                                    0
                                    )
                                
                            
                        
                     are assigned to PE(0,0), PE(0,1), PE(0,2), and PE(0,3), respectively. Neurons                         
                            
                                
                                    I
                                
                                
                                    (
                                    0
                                    ,
                                     
                                    1
                                    )
                                
                                
                                    (
                                    0
                                    )
                                
                            
                            ,
                             
                            
                                
                                    I
                                
                                
                                    (
                                    0
                                    ,
                                     
                                    2
                                    )
                                
                                
                                    (
                                    0
                                    )
                                
                            
                            ,
                             
                            
                                
                                    I
                                
                                
                                    (
                                    0
                                    ,
                                     
                                    3
                                    )
                                
                                
                                    (
                                    0
                                    )
                                
                            
                            ,
                             
                            
                                
                                    I
                                
                                
                                    (
                                    0
                                    ,
                                     
                                    4
                                    )
                                
                                
                                    (
                                    0
                                    )
                                
                            
                        
                     are assigned to PE(1,0), PE(1,1), PE(1,2), and PE(1,3), respectively. Obviously, the pairs of PE(0,1) and PE(1,0), PE(0,2) and PE(1,1), PE(0,3) and PE(1,2) receive identical neurons. It is not easy to utilize the data overlapping since PEs in each pair are not located in the same column or the same row.” ¶ 2, § 4.3, p. 558: “Relax alignment (RA) can exploit the non-aligned neurons by reordering the synapses”; and “Then, at clock t, we can compute output neuron                         
                            
                                
                                    O
                                
                                
                                    (
                                    0
                                    ,
                                     
                                    0
                                    )
                                
                                
                                    (
                                    0
                                    )
                                
                            
                             
                        
                    by accessing neurons                         
                            
                                
                                    I
                                
                                
                                    (
                                    0
                                    ,
                                     
                                    4
                                    )
                                
                                
                                    (
                                    0
                                    )
                                
                            
                            ,
                             
                            
                                
                                    I
                                
                                
                                    (
                                    0
                                    ,
                                     
                                    1
                                    )
                                
                                
                                    (
                                    0
                                    )
                                
                            
                            ,
                             
                            
                                
                                    I
                                
                                
                                    (
                                    0
                                    ,
                                     
                                    2
                                    )
                                
                                
                                    (
                                    0
                                    )
                                
                            
                            ,
                             
                            
                                
                                    I
                                
                                
                                    (
                                    0
                                    ,
                                     
                                    3
                                    )
                                
                                
                                    (
                                    0
                                    )
                                
                            
                        
                     and synapses                         
                            
                                
                                    K
                                
                                
                                    (
                                    0
                                    ,
                                     
                                    3
                                    )
                                
                                
                                    (
                                    0,0
                                    )
                                
                            
                            ,
                             
                            
                                
                                    K
                                
                                
                                    (
                                    0,0
                                    )
                                
                                
                                    (
                                    0,0
                                    )
                                
                            
                            ,
                             
                            
                                
                                    K
                                
                                
                                    (
                                    0,1
                                    )
                                
                                
                                    (
                                    0,0
                                    )
                                
                            
                            ,
                             
                            
                                
                                    K
                                
                                
                                    (
                                    0
                                    ,
                                     
                                    2
                                    )
                                
                                
                                    (
                                    0,0
                                    )
                                
                            
                             
                        
                    in the second PE row. In this way, the neurons                         
                             
                            
                                
                                    I
                                
                                
                                    (
                                    0
                                    ,
                                     
                                    1
                                    )
                                
                                
                                    (
                                    0
                                    )
                                
                            
                            ,
                             
                            
                                
                                    I
                                
                                
                                    (
                                    0
                                    ,
                                     
                                    2
                                    )
                                
                                
                                    (
                                    0
                                    )
                                
                            
                            ,
                             
                            
                                
                                    I
                                
                                
                                    (
                                    0
                                    ,
                                     
                                    3
                                    )
                                
                                
                                    (
                                    0
                                    )
                                
                            
                        
                     overlapped across the two PEs are exploited by reordering the synapses. To simplify the interconnections, the synapses of whole kernel are replicated to each PE, which also beneﬁts the reusability of synapses.”
The examiner notes that Lu teaches that the pairs of PE(0,1) and PE(1,0), PE(0,2) and PE(1,1), PE(0,3) and PE(1,2) receive identical neurons.  Therefore, for a pair of PEs (e.g., PE(0,1) and PE(1,0)), the input data for these neurons that are present in both PE(0,1) and PE(1,0) teaches a common portion of the input data.  Lu’s identifying the aforementioned common portion of the input data to be used by PE(0,1) and PE(1,0) to compute the output neuron thus teaches the above limitation.)
 
shifting the common portion of the input data into the first and second groups of PEs. (Lu at ¶ 2, ¶ 4.3, p. 558: “Neurons                         
                            
                                
                                    I
                                
                                
                                    (
                                    0
                                    ,
                                     
                                    0
                                    )
                                
                                
                                    (
                                    0
                                    )
                                
                            
                            ,
                             
                            
                                
                                    I
                                
                                
                                    (
                                    0
                                    ,
                                     
                                    1
                                    )
                                
                                
                                    (
                                    0
                                    )
                                
                            
                            ,
                             
                            
                                
                                    I
                                
                                
                                    (
                                    0
                                    ,
                                     
                                    2
                                    )
                                
                                
                                    (
                                    0
                                    )
                                
                            
                            ,
                             
                            
                                
                                    I
                                
                                
                                    (
                                    0
                                    ,
                                     
                                    3
                                    )
                                
                                
                                    (
                                    0
                                    )
                                
                            
                        
                     are assigned to PE(0,0), PE(0,1), PE(0,2), and PE(0,3), respectively. Neurons                         
                            
                                
                                    I
                                
                                
                                    (
                                    0
                                    ,
                                     
                                    1
                                    )
                                
                                
                                    (
                                    0
                                    )
                                
                            
                            ,
                             
                            
                                
                                    I
                                
                                
                                    (
                                    0
                                    ,
                                     
                                    2
                                    )
                                
                                
                                    (
                                    0
                                    )
                                
                            
                            ,
                             
                            
                                
                                    I
                                
                                
                                    (
                                    0
                                    ,
                                     
                                    3
                                    )
                                
                                
                                    (
                                    0
                                    )
                                
                            
                            ,
                             
                            
                                
                                    I
                                
                                
                                    (
                                    0
                                    ,
                                     
                                    4
                                    )
                                
                                
                                    (
                                    0
                                    )
                                
                            
                        
                     are assigned to PE(1,0), PE(1,1), PE(1,2), and PE(1,3), respectively. Obviously, the pairs of PE(0,1) and PE(1,0), PE(0,2) and PE(1,1), PE(0,3) and PE(1,2) receive identical neurons. It is not easy to utilize the data overlapping since PEs in each pair are not located in the same column or the same row.” ¶ 2, § 4.3, p. 558: “Relax alignment (RA) can exploit the non-aligned neurons by reordering the synapses”; and “Then, at clock t, we can compute output neuron                         
                            
                                
                                    O
                                
                                
                                    (
                                    0
                                    ,
                                     
                                    0
                                    )
                                
                                
                                    (
                                    0
                                    )
                                
                            
                             
                        
                    by accessing neurons                         
                            
                                
                                    I
                                
                                
                                    (
                                    0
                                    ,
                                     
                                    4
                                    )
                                
                                
                                    (
                                    0
                                    )
                                
                            
                            ,
                             
                            
                                
                                    I
                                
                                
                                    (
                                    0
                                    ,
                                     
                                    1
                                    )
                                
                                
                                    (
                                    0
                                    )
                                
                            
                            ,
                             
                            
                                
                                    I
                                
                                
                                    (
                                    0
                                    ,
                                     
                                    2
                                    )
                                
                                
                                    (
                                    0
                                    )
                                
                            
                            ,
                             
                            
                                
                                    I
                                
                                
                                    (
                                    0
                                    ,
                                     
                                    3
                                    )
                                
                                
                                    (
                                    0
                                    )
                                
                            
                        
                     and synapses                         
                            
                                
                                    K
                                
                                
                                    (
                                    0
                                    ,
                                     
                                    3
                                    )
                                
                                
                                    (
                                    0,0
                                    )
                                
                            
                            ,
                             
                            
                                
                                    K
                                
                                
                                    (
                                    0,0
                                    )
                                
                                
                                    (
                                    0,0
                                    )
                                
                            
                            ,
                             
                            
                                
                                    K
                                
                                
                                    (
                                    0,1
                                    )
                                
                                
                                    (
                                    0,0
                                    )
                                
                            
                            ,
                             
                            
                                
                                    K
                                
                                
                                    (
                                    0
                                    ,
                                     
                                    2
                                    )
                                
                                
                                    (
                                    0,0
                                    )
                                
                            
                             
                        
                    in the second PE row. In this way, the neurons                         
                             
                            
                                
                                    I
                                
                                
                                    (
                                    0
                                    ,
                                     
                                    1
                                    )
                                
                                
                                    (
                                    0
                                    )
                                
                            
                            ,
                             
                            
                                
                                    I
                                
                                
                                    (
                                    0
                                    ,
                                     
                                    2
                                    )
                                
                                
                                    (
                                    0
                                    )
                                
                            
                            ,
                             
                            
                                
                                    I
                                
                                
                                    (
                                    0
                                    ,
                                     
                                    3
                                    )
                                
                                
                                    (
                                    0
                                    )
                                
                            
                        
                     overlapped across the two PEs are exploited by reordering the synapses. To simplify the interconnections, the synapses of whole kernel are replicated to each PE, which also beneﬁts the reusability of synapses.”
The examiner notes that the input data for the identical neurons in the aforementioned pair of PEs teaches the common portion of input data, and that Lu’s accessing these identical neurons for data to compute the output neuron teaches the above limitation.)
Song and Lu are analogous art because both pertain to parallelism for neural networks with an array of neural network accelerators.  
It would have been obvious for a person of ordinary skill in the art prior to the effective filing date to combine Song’s “HyPar” or hybrid parallelism (Song, ¶ 2, Abstract, supra) with Lu’s identifying and shifting common data into multiple groups of PEs (see Lu, supra). The modification not only solves the problems of conventional approaches that exhibit significant overlapping yet non-aligned input neurons and kernels for output neurons (¶ 2, § 4.3, p. 558: “Based on complementary parallelism mechanism, different PE rows are allocated to output neurons at the same or adjacent locations of multiple output feature maps, the input neurons and kernels for these output neurons are significantly overlapped to each other. As Figure 8(a2) shown, input neurons partially and kernels totally are overlapped between the first and the second PE rows. However, the overlapping is not aligned horizontally or vertically.”) but also simplifies neural network interconnections and optimizes data transfer in neural networks (¶ 1, § 4.3, p. 558: “To simplify the interconnection, the key is to minimize the volume of data repeatedly transmitted. We use horizontal (kernels) and vertical (neurons) common data buses (CDB) to broadcast data (neuron and kernel lines in Figure 6). CDBs are realized as simple, pipelined, data-only buses that do not dictate overhead for address decoding, or complex control, hence scalable and easy to route in layout. Accordingly, we proposed two data transfer optimizations to CDB: Relax Alignment (RA) and Relax Synchronization (RS).”) 
 
With respect to claim 5, Song modified by Lu teaches the method according to claim 3, and Song further teaches: 
identifying a first portion of the input data to be used by the first group of PEs, and a second portion of the input data to be used by the second group of PEs; (Song, p. 4, § 3.1.1, ¶ 1: “In data parallelism, a batch of data is partitioned into two parts, while the kernels (weight matrix) are duplicated. Each accelerator holds one part of the partitioned data and a complete copy of the kernel.” p. 4, § 3.1.2, ¶ 1: “In model parallelism, the kernel is partitioned, and feature maps are partitioned accordingly.” and Figure 1:

    PNG
    media_image8.png
    200
    400
    media_image8.png
    Greyscale

The examiner notes that Song’s first accelerator that processes input data with a partitioned kernel in its two-accelerator architecture teaches a first group of PEs, and the second accelerator that processes the input data with another partitioned kernel (e.g., the remaining portion of the kernel that is partitioned in model parallelism) teaches a second group of PEs.  In addition or in the alternative, Song’s first accelerator that holds the first part of the batch of data and second accelerator holding the second part of the batch of data respectively teach a first group of PEs and a second group of PEs.  The examiner further notes that in Song’s two-accelerator example, the data from two input neurons and/or two of the eight synapses distributed to a first PE of four different PEs teaches a first portion of the input data to be used by the first group of PEs, and the data from the two input neurons and two different synapses distributed to a second PE teaches a second portion of the input data to be used by the second group of PEs.)
shifting the first portion of the input data into the first group of PEs; and (Song, p. 4, § 3.1, ¶ 1: “Assume we have two accelerators, the batch size is B = 32. Let us consider a fully-connected layer, where the number of input and output neurons are 70 and 100, respectively. Thus, the feature map Fl has a size of 32 ×70, the kernel (weight matrix) has a size of 70 ×100 and Fl+1 has a size of 32 ×100.”)  p. 4, § 3.1.1, ¶ 3: “In forward, each accelerator performs the computation in Equation 1. Because f (.) is an element-wise operation, which only requires local data in the accelerator itself but does not require remote data from the other accelerator, we focus on the multiplication and represent Equation 1 as Fl [Wingdings font/0xE0] Wl [Symbol font/0xDE] Fl+1. For the one holding the rectangles with shadow lines, it performs a multiplication with a size of [16 x 70] [Symbol font/0xDE] [70 x 100] [Symbol font/0xDE] [16 x 100].”  P. 4, FIG. 1(a) “data parallelism”: 

    PNG
    media_image9.png
    362
    384
    media_image9.png
    Greyscale

	The examiner first notes that the claimed limitation, “shifting a first portion of the input data into the first group of PEs”, under its broadest reasonable interpretation, covers providing the first portion of the input data to the first group of PEs. The examiner also notes that an accelerator in Song’s two-accelerator architecture teaches a first group of PEs.  The examiner further notes that Song’s halving the input feature map Fl having a size of 32 x 70 into halves each having a size of 16 x 70 and subsequently sending one of the two halves to one of the two accelerators in Song’s two-accelerator architecture as shown in FIG. 1(a) and described in the cited passages above teaches shifting a first portion of data into a first group of PEs.)
shifting the second portion of the input data into the second group of PEs. (Song, p. 4, § 3.1, ¶ 1: “Assume we have two accelerators, the batch size is B = 32. Let us consider a fully-connected layer, where the number of input and output neurons are 70 and 100, respectively. Thus, the feature map Fl has a size of 32 ×70, the kernel (weight matrix) has a size of 70 ×100 and Fl+1 has a size of 32 ×100.”)  p. 4, § 3.1.1, ¶ 3: “In forward, each accelerator performs the computation in Equation 1. Because f (.) is an element-wise operation, which only requires local data in the accelerator itself but does not require remote data from the other accelerator, we focus on the multiplication and represent Equation 1 as Fl [Wingdings font/0xE0] Wl [Symbol font/0xDE] Fl+1. For the one holding the rectangles with shadow lines, it performs a multiplication with a size of [16 x 70] [Symbol font/0xDE] [70 x 100] [Symbol font/0xDE] [16 x 100].”  P. 4, FIG. 1(a) “data parallelism”, supra.
The examiner first notes that the claimed limitation, “shifting a second portion of the input data into the second group of PEs”, under its broadest reasonable interpretation, covers providing the second portion of the input data to the second group of PEs. The examiner also notes that the other accelerator (other than the accelerator interpreted as the first group of PEs above) in Song’s two-accelerator architecture teaches a second group of PEs.  The examiner further notes that Song’s sending the other half of the two halves to the other accelerator of the two accelerators in Song’s two-accelerator architecture as shown in FIG. 1(a) and described in the cited passages above teaches shifting a second portion of data into a second group of PEs.)
  
With respect to claim 6, Song modified by Lu teaches the method according to claim 1, and Song further teaches: 
wherein the plurality of outputs are outputs of convolution operations for the layer of the neural network. (Song at ¶ § 3.1.1, p. 4: “For a convolutional layer l, we use Fl to represent feature maps of this layer”’ and “[w]e use ⊗ to denote convolutions. The inference (forward propagation) can be represented as, Fl+1 = f (Fl ⊗ Wl) (1).” The examiner notes that Song’s feature map tensor(s) (e.g., Fl) teaches a plurality of output that are outputs of “a convolutional layer l” as taught above.)
 
 
With respect to claim 12, it is substantially similar to claim 2 and is rejected in the same manner, the same art and reasoning applying. 
 
With respect to claim 13, it is substantially similar to claim 3 and is rejected in the same manner, the same art and reasoning applying. 
 
With respect to claim 14, it is substantially similar to claim 4 and is rejected in the same manner, the same art and reasoning applying. 
 

 
With respect to claim 16, it is substantially similar to claim 6 and is rejected in the same manner, the same art and reasoning applying. 
 
 
10.	Claims 7-10 and 17-20 stand rejected under 35 U.S.C. 103 as being unpatentable over Song et al., HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array (January 7, 2019) (hereinafter Song) in view of Lu et al., FlexFlow: A Flexible Dataﬂow Accelerator Architecture for Convolutional Neural Networks (2017) (hereinafter Lu) and further in view of Shawahna et al., FPGA-Based Accelerators of Deep Learning Networks for Learning and Classification: A Review (Dec. 28, 2018) (hereinafter Shawahna).

With respect to claim 7, Song modified by Lu teaches the method according to claim 6, and Song further teaches: 
wherein: the input data comprises a first plurality of input values and a second plurality of input values, and (Song, ¶ 1, § 2.1, p. 2: “The inference of deep neural networks is a forward progress of input data (typically images) from the first layer to the last layer.” ¶ 1, § 2.1, p. 2: “The size of the feature map slice is [ Hl × Wl × Cl]. Thus, Fl is of size B ×[ Hl × Wl × Cl ]”; and “[t]he inference (forward propagation) can be represented as, Fl+1 = f (Fl ⊗ Wl)  (1).” ¶ 1, § 3.1, p. 4: “Assume we have two accelerators, the batch size is B = 32. Let us consider a fully-connected layer, where the number of input and output neurons are 70 and 100, respectively. Thus, the feature map Fl has a size of 32 ×70, the kernel (weight matrix) has a size of 70 ×100 and Fl+1 has a size of 32 ×100.”  ¶ 1, § 3.1.2, p. 4: “In model parallelism, the kernel is partitioned, and feature maps are partitioned accordingly. In forward, each accelerator performs computation for the matrices Fl →Wl ⇒Fl+1 with sizes of [32 ×35] → [35 ×100] ⇒ [32 ×100].”  ¶ 1, § 3.3, p. 5: “At each clock cycle, it handles Tn input feature maps and Tm output feature maps (i.e. Multiple Feature maps), one neuron of each output feature map (i.e. Single Neuron), and one single synapse (i.e. Single Synapse) of each kernel.”
The examiner notes that Song’s input feature map (Fl) and/or image data teaches a first plurality of input values, and that Song’s weight matrix teaches a second plurality of input values.  The examiner further notes that Song’s accelerator performing the matrix operation (e.g., Fl →Wl ⇒Fl+1 or Eq. (1) cited above) on the input feature map (Fl) and the weight matrix to generate the output feature map (Fl+1) teaches that both the input feature map (Fl) and the weight matrix are input data to the accelerator.  Therefore, Song teaches the above limitation.)
 
generating, by each PE of the first group of PEs, the corresponding output of the plurality of outputs, comprises: (Song, ¶ 1, § 3.1.2, p. 4: “In model parallelism, the kernel is partitioned, and feature maps are partitioned accordingly. In forward, each accelerator performs computation for the matrices Fl →Wl ⇒Fl+1 with sizes of [32 ×35] → [35 ×100] ⇒ [32 ×100].”  The examiner notes that each accelerator of the two-accelerator example in Song’s § 3 receives the input portion Fl and generates the respective output Fl+1 teaches each PE generating a corresponding output as claimed because each PE (e.g., neuron) performs the matrix operation (Fl →Wl ⇒ Fl+1 cited immediately above) by receiving at least one input value (e.g., an entry in the input feature map (Fl) cited above) from an input neuron and multiplying the received input value by the corresponding weight (an entry in Wl cited above) of the connecting between the PE and the input neuron to compute the respective product. These respective products are then summed to generate an entry in the output feature map (Fl+1). Therefore, the examiner asserts that Song teaches the above limitation.)

receiving, by said each PE, first values of the first plurality of input values and second values of the second plurality of input values; (Song at ¶ 1, § 3.1.2, p. 4: “In model parallelism, the kernel is partitioned, and feature maps are partitioned accordingly. In forward, each accelerator performs computation for the matrices Fl →Wl ⇒Fl+1 with sizes of [32 ×35] → [35 ×100] ⇒ [32 ×100].”  The examiner notes that each accelerator of Song’s two-accelerator example in Song’s § 3 receives the respective input portion Fl (e.g., of size [32 x 35] from the original feature map having the size of [32 x 70)) teaches a first PE receiving first values of the plurality of input values.  The examiner further notes that each PE’s receiving the kernel (e.g., Wl above) teaches receiving second values of the plurality of input values  .)

Song does not appear to explicitly teach:  

generating a first dot product of the first values and the second values. 

Lu does, however, teach: 
storing, by said each PE, the first values and the second values in a first buffer memory and a second buffer memory of said each PE, respectively; and (Lu, ¶ 2, § 4, p. 557: “There are four key components: a convolutional unit, a pooling unit, three on-chip buffers (two neuron buffers and one kernel buffer), and an instruction decoder.” ¶ 1, § 4.1, p. 557: “For PEs in FlexFlow (Figure 7 (a)), by contrast, operands are directly derived from on-chip buffers through vertical and horizontal buses to each PE, and buffered in randomly accessed local storages.” FIG. 7(a): 

    PNG
    media_image10.png
    184
    169
    media_image10.png
    Greyscale

Last paragraph, left-hand column, p. 558: “Due to each mix of parallelism with a speciﬁc combination of inputs and outputs, the premise of enjoying the complementary effects is the ﬂexible dataﬂow with high data “routability”. We propose a hierarchical dataﬂow with low control overhead, as Figure 9 shown. The distribution layer can be viewed as an interconnection structure, which routes the data from on-chip buffer to PEs.” Also see FIG. 12 (reproduction omitted).  
The examiner first notes that FIG. 12(a) shows that Lu’s “kernel buffer” stores the values of the kernel (e.g., “synapse values” as taught in ¶ 2, § 2.1, p. 554). The examiner further notes that FIG. 7(a) shows that the PE performs multiplication (* in FIG. 7(a)) on the values of the kernel (“synapse values” as taught in ¶ 2, § 2.1, p. 554) from the “kernel local store” and the neuron data in the “neuron local store”. Moreover, the values of the kernel in the kernel local store are obtained “from kernel buffer”, and the input neuron values in the “neuron local store” are obtained “[f]rom neuron buffer” as explicitly shown in FIG. 7(a).  Therefore, the examiner assets that Lu’s storing neuron input data values in an on-chip neuron buffer teaches each PE storing the first values in the first buffer memory, and that Lu’s storing the kernel values (e.g., synapse values) in the on-chip kernel buffer teaches each PE storing the second values in the second buffer memory of the PE.)
 
generating a first dot product of the first values and the second values. (Lu, FIG. 3: “                        
                            
                                
                                    O
                                
                                
                                    (
                                    r
                                    ,
                                     
                                    c
                                    )
                                
                                
                                    (
                                    m
                                    )
                                
                            
                            +
                            =
                             
                            
                                
                                    K
                                
                                
                                    (
                                    i
                                    ,
                                     
                                    j
                                    )
                                
                                
                                    (
                                    m
                                    ,
                                     
                                    n
                                    )
                                
                            
                             
                            ×
                            
                                
                                    I
                                
                                
                                    (
                                    r
                                    +
                                    i
                                    ,
                                     
                                     
                                     
                                    c
                                    +
                                    j
                                    )
                                
                                
                                    (
                                    n
                                    )
                                
                            
                        
                    ”. The examiner notes that Lu’s FIG. 3 teaches computing a product of an element-by-element multiplication between two matrices (e.g., matrices K and I in Eq. (3)) and summation of these computed products (e.g., matrix O) and hence a dot product between the first values (e.g., values in row(s) of a feature map I) and the second values (e.g., the corresponding column kernel(s) in the kernel K shown in FIG. 3).
Song and Lu are analogous art because both pertain to parallelism for neural networks with an array of neural network accelerators.  
It would have been obvious for a person of ordinary skill in the art prior to the effective filing date to combine Song’s generating a corresponding output by each PE with a first plurality of input values and a second plurality of input values (see Song, supra) with Lu’s first and second on-chip buffers that respectively store the first and second plurality of values as well as Lu’s convolution module (see Lu, supra).  The modification not only addresses the inefficient sequential shifting of data in conventional approaches with multiple clock cycles by providing randomly accessible local storage (Lu, p. 557, § 4.1, ¶ 1: “Compared with other designs, our design has more flexibility to cover multiple parallel types. For comparison, the PE of 2D-Mapping architecture is depicted in Figure 7 (b), the input neuron of this PE can only come from its left neighbor or down neighbor PEs, and consecutive neurons must belong to adjacent convolutional windows. Moreover, these neurons are shifted to left or up PEs sequentially in next few clock cycles since they are buffered in FIFOs. For PEs in FlexFlow (Figure 7 (a)), by contrast, operands are directly derived from on-chip buffers through vertical and horizontal buses to each PE, and buffered in randomly accessed local storages.”) but also tackles the highly computation-intensive requirement of convolutional layers with Lu’s feature map parallelism (FP), neuron parallelism (NP), and synapse parallelism (SP) (Lu, p. 554, § 2.1, right-hand column, ¶ 1: “It’s well-known that CONV layers are highly computation-intensive. For a typical CNN application, CONV layers take up more than 90% of the computation volume in both inference and training procedures. Fortunately, CONV computation exhibits intensive parallelism in feature map, neuron, and synapse levels (kernel level parallelism is intrinsically included in feature map and neuron levels)”; and p. 554, § 2.2, right-hand column, ¶¶ 2-5: “There are three types of parallelism according to different unrolling strategies for related loops”, “Feature map Parallelism (FP),” “Neuron Parallelism (NP),” and “Synapse Parallelism (SP)”.)

Song modified by Lu does not appear to explicitly teach:
generating, by multiplier and accumulator (MAC) circuitry of said each PE, a first dot product of the first values and the second values. 

Shawahna does, however, teach: 
generating, by multiplier and accumulator (MAC) circuitry of said each PE, a first dot product of the first values and the second values. (Shawahna at FIG. 2(a), p. 7826:

    PNG
    media_image11.png
    200
    400
    media_image11.png
    Greyscale

¶ 2, left-hand column, p. 7826: “In summary, the convolution operation comprises four levels of loops; the output FMs loop (Loop-4), the loop across the input FMs (Loop-3), the loop along the dimensions of a single input FM (scan operation, Loop-2), and the kernel window size loop (multiply-and-accumulate (MAC) operation, Loop-1).” FIG. 6 (reproduction omitted) and ¶ 1, right-hand column, p. 7838: “Finally, operator-level parallelism is achieved by parallelizing the k × k MACs operations needed for convolution operation in convolutional layers or the n MACs needed for inner-product computation in fully connected layers.” The examiner first notes that Shawahna’s convolution operation in convolution layers teaches a dot product between the input x and the weight matrix W as shown in FIG. 6.  The examiner further notes that Shawahna’s inner-product computation also teaches a dot product of the first and second values.  The examiner thus asserts that Shawahna’s MAC circuitry that performs the convolution between the input (“x”) and the weight matrix (“W”) and sums the respective products as explicitly shown in FIG. 6 or performs a dot product computation as taught in Shawahna’s ¶ 1, right-hand column on p. 7838 teaches the above limitation.)
Song, Lu, and Shawahna are analogous art because all three pertain to parallelism for neural networks with an array of neural network accelerators.  
Song’s “HyPar” or hybrid parallelism (Song at ¶ 2, Abstract) modified by Lu’s “on-chip buffers” (see Lu, supra) and “CONV Operation” (see Lu, supra) with Shawahna’s “multiplier and accumulator (MAC) circuitry” and “loop unrolling” (see Shawahna, supra).  The modification with Shawahna’s MAC circuitry and loop unrolling for convolution operations allows for extremely fast and low power CNN implementations (Shawahna, ¶ 2, left-hand column, p. 7831: “Fortunately, recent digital signal processing (DSP)-oriented FPGAs include large numbers of multiply-and-accumulate (MAC) units which allow for extremely fast and low power CNN implementations.”)
 
With respect to claim 8, Song modified by Lu teaches the method according to claim 7, and Song further teaches: 
the first number of computed dot products are outputs of convolution operations for the layer of the neural network. (Song at ¶ 1, § 3.1, p. 4: “Assume we have two accelerators, the batch size is B = 32. Let us consider a fully-connected layer, where the number of input and output neurons are 70 and 100, respectively. Thus, the feature map Fl has a size of 32 ×70, the kernel (weight matrix) has a size of 70 ×100 and Fl+1 has a size of 32 ×100.” ¶ § 3.1.1, p. 4: “For a convolutional layer l, we use Fl to represent feature maps of this layer”’ and “[w]e use ⊗ to denote convolutions. The inference (forward propagation) can be represented as, Fl+1 = f (Fl ⊗ Wl) (1).” 
The examiner first notes that Song’s convolution (⊗) teaches multiplication of the first matrix (e.g., the feature map Fl) and the second matrix (e.g., the weight matrix Wl) and thus teaches an inner product of the first and second matrices for a layer. The examiner further notes that Song’s performing convolution (⊗) on the first matrix (e.g., feature map tensor(s) Fl) and the second matrix (e.g., weight matrix Wl) to produce the output feature map (Fl+1), and that the output feature map (Fl+1) teaches a plurality of output that are outputs of convolution operation (⊗) performed on an input feature map (Fl) and the corresponding kernel tensor (Wl).)
Song does not appear to explicitly teach: 
wherein: a first number of sets of values, out of the first plurality of input values, are stored in the first buffer memory of the said each PE, and
a dot product of (i) each of the first number of sets of values stored in the first buffer memory of the said each PE and (ii) the second values stored in the second buffer memory of the said each PE, is computed by the MAC circuitry, and 

	Lu does, however, teach: 
(Lu at ¶ 2, § 4, p. 557: “There are four key components: a convolutional unit, a pooling unit, three on-chip buffers (two neuron buffers and one kernel buffer), and an instruction decoder.” ¶ 1, § 4.1, p. 557: “For PEs in FlexFlow (Figure 7 (a)), by contrast, operands are directly derived from on-chip buffers through vertical and horizontal buses to each PE, and buffered in randomly accessed local storages.” FIG. 7(a): 

    PNG
    media_image10.png
    184
    169
    media_image10.png
    Greyscale

Last paragraph, left-hand column, p. 558: “We propose a hierarchical dataﬂow with low control overhead, as Figure 9 shown. The distribution layer can be viewed as an interconnection structure, which routes the data from on-chip buffer to PEs.” Also see FIG. 12 (reproduction omitted).  
The examiner first notes that FIG. 12(a) shows that Lu’s “kernel buffer” stores the values of the kernel (e.g., “synapse values” as taught in ¶ 2, § 2.1, p. 554). The examiner further notes that FIG. 7(a) shows that the PE performs multiplication (* in FIG. 7(a)) on the values of the kernel (“synapse values” as taught in ¶ 2, § 2.1, p. 554) from the “kernel local store” and the neuron data in the “neuron local store”. Moreover, the values of the kernel in the kernel local store are obtained “from kernel buffer”, and the input neuron values in the “neuron local store” are obtained “[f]rom neuron buffer” as explicitly shown in FIG. 7(a).  Therefore, the examiner asserts that Lu’s storing neuron data in an on-chip neuron buffer teaches each PE storing a first number of sets of the first values in the first buffer memory of each PE where one or more neuron values for the respective input neurons for the PE illustrated in, for example, FIG. 7(a) teach a set of the first values.)
 
a dot product of (i) each of the first number of sets of values stored in the first buffer memory of the said each PE and (ii) the second values stored in the second buffer memory of the said each PE, and (Lu at FIG. 3:                         
                            
                                
                                    O
                                
                                
                                    (
                                    r
                                    ,
                                     
                                    c
                                    )
                                
                                
                                    (
                                    m
                                    )
                                
                            
                            +
                            =
                             
                            
                                
                                    K
                                
                                
                                    (
                                    i
                                    ,
                                     
                                    j
                                    )
                                
                                
                                    (
                                    m
                                    ,
                                     
                                    n
                                    )
                                
                            
                             
                            ×
                            
                                
                                    I
                                
                                
                                    (
                                    r
                                    +
                                    i
                                    ,
                                     
                                     
                                     
                                    c
                                    +
                                    j
                                    )
                                
                                
                                    (
                                    n
                                    )
                                
                            
                        
                    . ¶¶ 2 and 4 as well as FIGS. 7(a) and 12 cited immediately above teaches storing neuron data in neuron buffer/neuron local store and storing kernel values in kernel buffer/kernel local store. The examiner notes that Lu’s FIG. 3 teaches a dot product between a first value (e.g., value                         
                            
                                
                                    I
                                
                                
                                    (
                                    r
                                    +
                                    i
                                    ,
                                     
                                     
                                     
                                    c
                                    +
                                    j
                                    )
                                
                                
                                    (
                                    n
                                    )
                                
                            
                             
                        
                    in a row of a feature map) stored in a neuron buffer memory of each PE and a second value (e.g., a corresponding column kernel                         
                            
                                
                                    K
                                
                                
                                    (
                                    i
                                    ,
                                     
                                    j
                                    )
                                
                                
                                    (
                                    m
                                    ,
                                     
                                    n
                                    )
                                
                            
                             
                        
                    in the K x K kernel shown in FIG. 3) stored in a second buffer memory of the PE.)
Song and Lu are analogous art because both pertain to parallelism for neural networks with an array of neural network accelerators.  
It would have been obvious for a person of ordinary skill in the art prior to the effective filing date to combine Song’s outputs of convolution operations that are a number of computed dot products (Song at ¶ 2, Abstract) with Lu’s “on-chip buffers” Lu, supra) and “CONV Operation” (Lu, FIG. 3, supra).  The modification not only addresses the inefficient sequential shifting of data in conventional approaches with multiple clock cycles by providing randomly accessible local storage (Lu, p. 557, § 4.1, ¶ 1: “Compared with other designs, our design has more flexibility to cover multiple parallel types. For comparison, the PE of 2D-Mapping architecture is depicted in Figure 7 (b), the input neuron of this PE can only come from its left neighbor or down neighbor PEs, and consecutive neurons must belong to adjacent convolutional windows. Moreover, these neurons are shifted to left or up PEs sequentially in next few clock cycles since they are buffered in FIFOs. For PEs in FlexFlow (Figure 7 (a)), by contrast, operands are directly derived from on-chip buffers through vertical and horizontal buses to each PE, and buffered in randomly accessed local storages.”) but also tackles the highly computation-intensive requirement of convolutional layers with Lu’s feature map parallelism (FP), neuron parallelism (NP), and synapse parallelism (SP) (Lu, p. 554, § 2.1, right-hand column, ¶ 1: “It’s well-known that CONV layers are highly computation-intensive. For a typical CNN application, CONV layers take up more than 90% of the computation volume in both inference and training procedures. Fortunately, CONV computation exhibits intensive parallelism in feature map, neuron, and synapse levels (kernel level parallelism is intrinsically included in feature map and neuron levels)”; and p. 554, § 2.2, right-hand column, ¶¶ 2-5: “There are three types of parallelism according to different unrolling strategies for related loops”, “Feature map Parallelism (FP),” “Neuron Parallelism (NP),” and “Synapse Parallelism (SP)”.)

Song modified by Lu does not appear to explicitly teach:
a dot product of (i) each of the first number of sets of values and (ii) the second values, is computed by the MAC circuitry, and 

Shawahna does, however, teach: 
a dot product of (i) each of the first number of sets of values and (ii) the second values, is computed by the MAC circuitry, and (Shawahna at FIG. 2(a)

    PNG
    media_image11.png
    200
    400
    media_image11.png
    Greyscale

¶ 2, left-hand column, p. 7826: “In summary, the convolution operation comprises four levels of loops; the output FMs loop (Loop-4), the loop across the input FMs (Loop-3), the loop along the dimensions of a single input FM (scan operation, Loop-2), and the kernel window size loop (multiply-and-accumulate (MAC) operation, Loop-1).” FIG. 6 (reproduction omitted) and ¶ 1, right-hand column, p. 7838: “Finally, operator-level parallelism is achieved by parallelizing the k × k MACs operations needed for convolution operation in convolutional layers or the n MACs needed for inner-product computation in fully connected layers.” The examiner first notes that Shawahna’s convolution operation in convolution layers teaches a dot product between the input x and the weight matrix W as shown in FIG. 6.  The examiner further notes that Shawahna’s inner-product computation also teaches a dot product of the first and second values.  The examiner thus asserts that Shawahna’s MAC circuitry that performs the convolution between the input matrix and the weight matrix and sums the respective products as explicitly shown in FIG. 6 or performs a dot product as taught in Shawahna’s ¶ 1, right-hand column on p. 7838 teaches the above limitation.)
Song, Lu, and Shawahna are analogous art because all three pertain to parallelism for neural networks with an array of neural network accelerators.  
It would have been obvious for a person of ordinary skill in the art prior to the effective filing date to combine Song’s convolution operation outputs that are a number of computed dot products (see Song, supra) modified by Lu’s “on-chip buffers” (see Lu, supra) and “CONV Operation” (see Lu, supra) with Shawahna’s “multiplier and accumulator (MAC) circuitry” and “loop unrolling” (Shawahna, FIGS. 2(a) and 6 as well as ¶ 2, left-hand column, p. 7826, supra).  The modification with Shawahna’s MAC circuitry and loop unrolling for convolution operations allows for extremely fast and low power CNN implementations (Shawahna, ¶ 2, left-hand column, p. 7831: “Fortunately, recent digital signal processing (DSP)-oriented FPGAs include large numbers of multiply-and-accumulate (MAC) units which allow for extremely fast and low power CNN implementations.”) 
 
With respect to claim 9, Song modified by Lu teaches the method according to claim 7, and Song further teaches: 
the second number of computed dot products are outputs of convolution operations for the layer of the neural network. (Song at ¶ § 3.1.1, p. 4: “For a convolutional layer l, we use Fl to represent feature maps of this layer”’ and “[w]e use ⊗ to denote convolutions. The inference (forward propagation) can be represented as, Fl+1 = f (Fl ⊗ Wl) (1).” The examiner notes that Song’s feature map tensor(s) (e.g., Fl) teaches a plurality of output that are outputs of “a convolutional layer l” as dot products (⊗) between an input feature map (Fl) and the corresponding kernel tensor (Wl).)

Song does not appear to explicitly teach: 
wherein: a second number of sets of values, out of the plurality of second input values, are stored in the second buffer memory of the said each PE, 
a dot product of (i) the first values stored in the first buffer memory of the said each PE and (ii) each of the second number of sets of values stored in the second buffer memory of the said each PE, is computed by the MAC circuitry, and 
Lu does, however, teach: 
(Lu at ¶ 2, § 4, p. 557: “There are four key components: a convolutional unit, a pooling unit, three on-chip buffers (two neuron buffers and one kernel buffer), and an instruction decoder”; “A PE consists of a multiplier, an adder, a neuron local store, a kernel local store, and a controller, as Figure 7(a) shown.” ¶ 1, § 4.1, p. 557: “For PEs in FlexFlow (Figure 7 (a)), by contrast, operands are directly derived from on-chip buffers through vertical and horizontal buses to each PE, and buffered in randomly accessed local storages.” FIG. 7(a) on p. 557: 

    PNG
    media_image12.png
    205
    205
    media_image12.png
    Greyscale

Last paragraph, left-hand column, p. 558: “We propose a hierarchical dataﬂow with low control overhead, as Figure 9 shown. The distribution layer can be viewed as an interconnection structure, which routes the data from on-chip buffer to PEs.” 
Also see FIG. 12 (reproduction omitted).  The examiner first notes that FIG. 12(a) shows that Lu’s “kernel buffer” stores the values of the kernel (e.g., “synapse values” as taught in ¶ 2, § 2.1, p. 554). The examiner further notes that FIG. 7(a) shows that the PE performs multiplication (* in FIG. 7(a)) on the values of the kernel (“synapse values” as taught in ¶ 2, § 2.1, p. 554) from the “kernel local store” and the neuron data in the “neuron local store”. Moreover, the values of the kernel in the kernel local store are obtained “from kernel buffer”, and the input neuron values in the “neuron local store” are obtained “[f]rom neuron buffer” as explicitly shown in FIG. 7(a).  Therefore, the examiner asserts that Lu’s storing kernel data in an on-chip kernel buffer teaches each PE storing a second number of sets of values in the second buffer memory of each PE.)
 
a dot product of (i) the first values stored in the first buffer memory of the said each PE and (ii) each of the second number of sets of values stored in the second buffer memory of the said each PE, is computed, and (Lu at FIG. 3:                         
                            
                                
                                    O
                                
                                
                                    (
                                    r
                                    ,
                                     
                                    c
                                    )
                                
                                
                                    (
                                    m
                                    )
                                
                            
                            +
                            =
                             
                            
                                
                                    K
                                
                                
                                    (
                                    i
                                    ,
                                     
                                    j
                                    )
                                
                                
                                    (
                                    m
                                    ,
                                     
                                    n
                                    )
                                
                            
                             
                            ×
                            
                                
                                    I
                                
                                
                                    (
                                    r
                                    +
                                    i
                                    ,
                                     
                                     
                                     
                                    c
                                    +
                                    j
                                    )
                                
                                
                                    (
                                    n
                                    )
                                
                            
                            .
                             
                        
                     ¶¶ 2 and 4 as well as FIGS. 7(a) and 12 cited immediately above teaches storing neuron data in neuron buffer/neuron local store and storing kernel values in kernel buffer/kernel local store. The examiner notes that Lu’s FIG. 3 teaches a dot product between a first value (e.g., value in a row of a feature map) stored in a neuron buffer memory of each PE and a second value (e.g., a corresponding column kernel in the K x K kernel shown in FIG. 3) stored in a second buffer memory of the PE. The examiner further notes that Lu’s FIG. 3 further shows iterating through m, n, r, c, i, and j from 0 to their respective upper bounds (e.g., “m=0; m<M;M++)), and that iterating through each of the unrolling factors teaches that Lu’s computing a dot product of each of the second number of sets of values of each PE (and the first values stored in the first buffer memory).
Song and Lu are analogous art because both pertain to parallelism for neural networks with an array of neural network accelerators.  
It would have been obvious for a person of ordinary skill in the art prior to the effective filing date to combine Song’s convolution operation outputs that are a number of computed dot products (see Song ¶ 2, Abstract, supra) with Lu’s “on-chip buffers” (Lu at §§ 4 and 4.1 as well as FIGS. 7(a) and 12(a), supra) and “CONV Operation” (Lu, FIG. 3, supra).  The modification not only addresses the inefficient sequential shifting of data in conventional approaches with multiple clock cycles by providing randomly accessible local storage (Lu, p. 557, § 4.1, ¶ 1: “Compared with other designs, our design has more flexibility to cover multiple parallel types. For comparison, the PE of 2D-Mapping architecture is depicted in Figure 7 (b), the input neuron of this PE can only come from its left neighbor or down neighbor PEs, and consecutive neurons must belong to adjacent convolutional windows. Moreover, these neurons are shifted to left or up PEs sequentially in next few clock cycles since they are buffered in FIFOs. For PEs in FlexFlow (Figure 7 (a)), by contrast, operands are directly derived from on-chip buffers through vertical and horizontal buses to each PE, and buffered in randomly accessed local storages.”) but also tackles the highly computation-intensive requirement of convolutional layers with Lu’s feature map parallelism (FP), neuron parallelism (NP), and synapse parallelism (SP) (Lu, p. 554, § 2.1, right-hand column, ¶ 1: “It’s well-known that CONV layers are highly computation-intensive. For a typical CNN application, CONV layers take up more than 90% of the computation volume in both inference and training procedures. Fortunately, CONV computation exhibits intensive parallelism in feature map, neuron, and synapse levels (kernel level parallelism is intrinsically included in feature map and neuron levels)”; and p. 554, § 2.2, right-hand column, ¶¶ 2-5: “There are three types of parallelism according to different unrolling strategies for related loops”, “Feature map Parallelism (FP),” “Neuron Parallelism (NP),” and “Synapse Parallelism (SP)”.) 

Song modified by Lu does not appear to explicitly teach: 
a dot product of (i) the first values and (ii) each of the second number of sets of values, is computed by the MAC circuitry, and 

Shawahna does, however, teach: 
a dot product of (i) the first values and (ii) each of the second number of sets of values, is computed by the MAC circuitry, and (Shawahna, p. 7826, FIG. 2(a):

    PNG
    media_image11.png
    200
    400
    media_image11.png
    Greyscale

p. 7826, left-hand column, ¶ 2: “In summary, the convolution operation comprises four levels of loops; the output FMs loop (Loop-4), the loop across the input FMs (Loop-3), the loop along the dimensions of a single input FM (scan operation, Loop-2), and the kernel window size loop (multiply-and-accumulate (MAC) operation, Loop-1).” FIG. 6 (reproduction omitted) and ¶ 1, right-hand column, p. 7838: “Finally, operator-level parallelism is achieved by parallelizing the k × k MACs operations needed for convolution operation in convolutional layers or the n MACs needed for inner-product computation in fully connected layers.” The examiner first notes that Shawahna’s convolution operation in convolution layers teaches a dot product between the input x and the weight matrix W as shown in FIG. 6.  The examiner further notes that Shawahna’s inner-product computation also teaches a dot product of the first and second values.  The examiner thus asserts that Shawahna’s MAC circuitry that performs the convolution between the input matrix and the weight matrix and sums the respective products as explicitly shown in FIG. 6 or performs a dot product as taught in Shawahna’s on p. 7838, right-hand column, ¶ 1 teaches the above limitation.)
Song, Lu, and Shawahna are analogous art because all three pertain to parallelism for neural networks with an array of neural network accelerators.  
Song’s convolution operation outputs (see Song, supra) modified by Lu’s “on-chip buffers” (Lu, §§ 4 and 4.1 as well as FIGS. 7(a) and 12(a), supra) and “CONV Operation” (Lu, FIG. 3, supra) with Shawahna’s “multiplier and accumulator (MAC) circuitry” and “loop unrolling” (Shawahna at FIGS. 2(a) and 6 as well as ¶ 2, left-hand column, p. 7826, supra).  The modification is obvious because one of ordinary skill in the art will be motivated to combine Song’s teaching of performing convolutions as taught by Song in, for example, Eq. (1) cited above modified by Lu’s “on-chip buffers” (Lu at §§ 4 and 4.1 as well as FIGS. 7(a) and 12(a), supra) and “CONV Operation” (Lu, FIG. 3, supra) with Shawahna’s “multiplier and accumulator (MAC) circuitry” and “loop unrolling” (Shawahna, FIGS. 2(a) and 6 as well as ¶ 2, left-hand column, p. 7826, supra).  The modification with Shawahna’s MAC circuitry and loop unrolling for convolution operations allows for extremely fast and low power CNN implementations (Shawahna, ¶ 2, left-hand column, p. 7831: “Fortunately, recent digital signal processing (DSP)-oriented FPGAs include large numbers of multiply-and-accumulate (MAC) units which allow for extremely fast and low power CNN implementations.”) 

With respect to claim 10, Song modified by Lu teaches the method according to claim 7, and Song further teaches: 
wherein the first plurality of input values represent one of input data streams and weights for the layer of the neural network, and (Song at ¶ 1, § 2.1, p. 2: “The inference of deep neural networks is a forward progress of input data (typically images) from the ﬁrst layer to the last layer. Kernels (weights) of a network are obtained through training before the inference.” ¶ 1, § 3.1, p. 4: “Assume we have two accelerators, the batch size is B = 32. Let us consider a fully-connected layer, where the number of input and output neurons are 70 and 100, respectively. Thus, the feature map Fl has a size of 32 ×70, the kernel (weight matrix) has a size of 70 ×100 and Fl+1 has a size of 32 ×100.”  ¶ 1, § 3.1.2, p. 4: “In model parallelism, the kernel is partitioned, and feature maps are partitioned accordingly. In forward, each accelerator performs computation for the matrices Fl →Wl ⇒Fl+1 with sizes of [32 ×35] → [35 ×100] ⇒ [32 ×100].”  ¶ 1, § 3.3, p. 5: “At each clock cycle, it handles T n input feature maps and T m output feature maps (i.e. Multiple Feature maps), one neuron of each output feature map (i.e. Single Neuron), and one single synapse (i.e. Single Synapse) of each kernel.”
The examiner first notes that the present disclosure describes input data streams as “image data” in ¶ [0078].  The examiner further notes that Song’s input neuron data such as “images” as taught in ¶ 1, § 2.1, p. 2 cited immediately above teaches input data streams in view of ¶ [0078] of the present disclosure.  Moreover, Song’s kernel (“weight matrix” as taught in ¶ 1, § 3.1, p. 4 cited above) teaches the claimed weights.  Therefore, the examiner asserts that Song’s receiving the input data streams and weights for a layer at each clock cycle thus teaches input data streams and weights for the layer of the neural network.  The examiner further notes that each accelerator of Song’s two-accelerator example in Song’s § 3 receives the respective input portion Fl (e.g., of size [32 x 35] from the original feature map having the size of [32 x 70]) and the corresponding kernel (e.g., Wl above) teaches the first plurality of input values represent either the input feature map or the kernel and thus represent one of the input data streams and weights as claimed.)
 
the second plurality of input values represent the other of input data streams and weights for the layer of the neural network. (Song at ¶ 1, § 2.1, p. 2: “The inference of deep neural networks is a forward progress of input data (typically images) from the ﬁrst layer to the last layer. Kernels (weights) of a network are obtained through training before the inference.” ¶ 1, § 3.1, p. 4: “Assume we have two accelerators, the batch size is B = 32. Let us consider a fully-connected layer, where the number of input and output neurons are 70 and 100, respectively. Thus, the feature map Fl has a size of 32 ×70, the kernel (weight matrix) has a size of 70 ×100 and Fl+1 has a size of 32 ×100.”  ¶ 1, § 3.1.2, p. 4: “In model parallelism, the kernel is partitioned, and feature maps are partitioned accordingly. In forward, each accelerator performs computation for the matrices Fl →Wl ⇒ Fl+1 with sizes of [32 ×35] → [35 ×100] ⇒ [32 ×100].”  ¶ 1, § 3.3, p. 5: “At each clock cycle, it handles T n input feature maps and T m output feature maps (i.e. Multiple Feature maps), one neuron of each output feature map (i.e. Single Neuron), and one single synapse (i.e. Single Synapse) of each kernel.”
The examiner first notes that the present disclosure describes input data streams as “image data” in ¶ [0078].  The examiner thus notes that Song’s input neuron data such as “images” as taught in ¶ 1, § 2.1, p. 2 cited above teaches input data streams in view of ¶ [0078] of the present disclosure. Moreover, Song’s kernel (“weight matrix” as taught in ¶ 1, § 3.1, p. 4 cited above) teaches the claimed weights.  Therefore, the examiner asserts that Song’s receiving the input data streams and kernel (weights) at each clock cycle thus teaches input data streams and weights, respectively.  The examiner further notes that each accelerator of Song’s two-accelerator example in Song’s § 3 receives the respective input portion Fl (e.g., of size [32 x 35] from the original feature map having the size of [32 x 70]) and the corresponding kernel (e.g., Wl above) teaches the first plurality of input values represent the other of the input data streams and weights as claimed.)
 
With respect to claim 17, it is substantially similar to claim 7 and is rejected in the same manner, the same art and reasoning applying. 
 
With respect to claim 18, it is substantially similar to claim 8 and is rejected in the same manner, the same art and reasoning applying. 

With respect to claim 19, it is substantially similar to claim 9 and is rejected in the same manner, the same art and reasoning applying. 

With respect to claim 20, it is substantially similar to claim 10 and is rejected in the same manner, the same art and reasoning applying. 


Conclusion
11.	The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
(a)	Chen et al., Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks (Jan. 31-Feb. 4, 2016) teaches that existing accelerators do not support the configurability necessary to efficiently support large CNNs with different shapes [3], and using mobile GPUs can be expensive [4] and describes an accelerator that can deliver state-of-the art accuracy with minimum energy consumption in the system (including DRAM)  in real-time, by using two key methods: (1) efficient dataflow and supporting hardware (spatial array, memory hierarchy and on-chip network) that minimize data movement by exploiting data reuse and support different shapes; (2) exploit data statistics to minimize energy through zeros skipping/gating to avoid unnecessary reads and computations; and data compression to reduce off-chip memory bandwidth, which is the most expensive data movement.
(b)	Park et al., Toward Optimal FPGA Implementation of Deep Convolutional Neural Networks for Handwritten Hangul Character Recognition (Mar 2018) teaches present a field programmable gate array (FPGA)-based hardware 

12.	Any inquiry concerning this communication or earlier communications from the examiner should be directed to ERICH C. TZOU whose telephone number is (571)272-9852.  The examiner can normally be reached on Monday-Friday 7:30AM-5:00PM EST with alternative Fridays off.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann J. Lo can be reached on 571-272-9767.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer 



/E.C.T./Examiner, Art Unit 2126
/ANN J LO/Supervisory Patent Examiner, Art Unit 2126