DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-20 are pending and have been examined.
The present application was filed on 04/21/2018.
Information Disclosure Statement

The information disclosure statement (IDS) was submitted on 07/18/2018 and 10/08/2019.  The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claims 1, 8, and 15 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Jordan et al.  (US 5,949,920 A).
Regarding Claim 1,
Jordan et al. teaches a neural network processor configured to perform convolution operations on input data and N by N matrices, where N is a positive integer greater than one (col. 1, lines 44-46: “ … a  reconfigurable convolver for performing a convolution of pixels of an image is provided …” teaches a reconfigurable convolver [neural network processor] performing  convolution of pixels of an image [convolution operations on input data];  
col 1, lines 14-17 “ … Convolution is a weighted sum of pixels in the neighborhood of a source pixel. The weights are determined by a matrix of coefficients called a convolution mask or convolution kernel, which is usually square …” teaches a convolution mask comprising a matrix of coefficients [N by N matrices] used in convolution process;
col 1, lines 29-32 “ … Performing a 7 x 7 convolution on a two dimensional image requires 29 multiplies and 48 adds for each output pixel generated.  Images that are filtered with a 7 x 7 convolution mask often have 256x256 or 512x512 pixels …” teaches 7 x 7 convolution mask [N by N matrices, wherein N is a positive integer greater than one]), 
	the neural network processor comprising:
	a plurality of multiplier circuits (col 3, lines 15-22 “A block diagram of an example of a convolver circuit in accordance with the invention is shown in FIG. 1. Pixel values of a convolution window and coefficients of a convolution mask are supplied to inputs of a multiplier unit 10. The multiplier 10 includes array of multipliers for performing M x Multiplications in parallel.  In the example of FIG. 1, the multiplier unit includes 25 multipliers in a 5x5 multiplier array …” teaches a multiplier unit [a plurality of multiplier circuits]);
	a window expander circuit comprising:
col. 5, lines 15-23 “A circuit configuration for performing a 5 x 5 convolution of an image, using the 5x5 convolver circuit shown in FIGS. 1-3 … is shown schematically in FIG. 4 … Pixel values for the pixels of the convolution window are supplied to the multipliers by pixel buffers 80, 82, 84 and 86.  Each of the pixel buffers may hold the pixel values of one row, or line, of the image …” teaches a row of pixel values in an image stored in pixel buffers [a first logic circuit configured to store a set of data elements, corresponding to at least a subset of the input data]; 
col.3, lines 25-29   “In a 5x5 convolution, the 5x5 multiplier unit 10 receives 25 pixel values of a convolution window and 25 corresponding coefficients of a convolution mask, and produces 25 products R0, Rl, ... R24. The summer 12 combines the products and provides a result S …” and col. 3, lines 53-54 “Memory 20 has sufficient capacity to store an intermediate result for each pixel in the image …”  teaches each of the 25 products/intermediate results  [each of a P number of data elements of the set of data elements] being stored across 25 logical memories [Q logical memories], with P = 25 [wherein P is an integer equal to or greater than one] and Q is greater than one of the dimension of the 5 x 5 convolution mask [Q is an integer equal to or greater than N]),  and
	a second logic circuit configured to receive the first set of data elements and additional data elements corresponding to the subset of the input data from the Q number of logical memories (col.3, lines 25-29   “In a 5x5 convolution, the 5x5 multiplier unit 10 receives 25 pixel values of a convolution window and 25 corresponding coefficients of a convolution mask, and produces 25 products R0, Rl, ... R24. The summer 12 combines the products and provides a result S …” and col. 3, lines 53-54 “Memory 20 has sufficient capacity to store an intermediate result for each pixel in the image …”  teaches the pixel values [first set of data elements] and 25 corresponding coefficients of a convolution mask [additional data elements] being received by the multiplier unit [second logic circuit] ) 
	and expand the at least the subset of input data until the at least the subset of the input data is expanded based on a predetermined factor selected at least to increase utilization of the plurality of multiplier circuits (col 6, lines 5-19: “It will be apparent that pixel values will not be available for all of the locations in the convolution window near the edges of the image … The lack of pixel values can be addressed in several ways.  In one approach, pixels near the edge of the image are not convolved, and the output image is smaller than the source image. This approach is less desirable for large convolution windows.  In another approach, arbitrary pixel values, such as for example, constant values, are used to fill the empty locations in the convolution window.  In still another approach, the pixel values in a row or column at the edge of the image are duplicated and are used to fill the empty locations in the convolution window”  teaches adding arbitrary pixel values via addition or duplication [expand at least the subset of input data] until the number of empty locations in the convolution window are filled [based on a predetermined factor], which triggers an increase in utilization of multiplier unit [selected at least to increase utilization of the plurality of multiplier circuits]).
Regarding Claim 8,
	Jordan et al. teaches a method in a neural network processor configured to perform convolution operations on input data and N by N matrices, where N is a positive integer greater col. 1, lines 44-46: “ … a  reconfigurable convolver for performing a convolution of pixels of an image is provided …” teaches a reconfigurable convolver [neural network processor] performing  convolution of pixels of an image [convolution operations on input data];  
col 1, lines 14-17 “ … Convolution is a weighted sum of pixels in the neighborhood of a source pixel. The weights are determined by a matrix of coefficients called a convolution mask or convolution kernel, which is usually square …” teaches a convolution mask comprising a matrix of coefficients [N by N matrices] used in convolution process;
col 1, lines 29-32 “ … Performing a 7 x 7 convolution on a two dimensional image requires 29 multiplies and 48 adds for each output pixel generated.  Images that are filtered with a 7 x 7 convolution mask often have 256x256 or 512x512 pixels …” teaches 7 x 7 convolution mask [N by N matrices, wherein N is a positive integer greater than one]), 
	wherein the neural network comprises a plurality of multiply circuits (col 3, lines 15-22 “A block diagram of an example of a convolver circuit in accordance with the invention is shown in FIG. 1. Pixel values of a convolution window and coefficients of a convolution mask are supplied to inputs of a multiplier unit 10. The multiplier 10 includes array of multipliers for performing M x Multiplications in parallel.  In the example of FIG. 1, the multiplier unit includes 25 multipliers in a 5x5 multiplier array …” teaches a multiplier unit [a plurality of multiplier circuits]),
	the method comprising: 
	automatically determining whether the input data received by the neural network processor requires expansion (col 6, lines 5-7: “It will be apparent that pixel values will not be available for all of the locations in the convolution window near the edges of the image. For example, when the top row of the image is being convolved, pixel values are not available for the first two rows of the 5 x 5 convolution window …” teaches missing pixel values in the first two rows of the convolution window as an indicator of necessary expansion [whether the input data received by the neural network model requires expansion]); and
	when the input data requires expansion: (1) storing a first set of data elements, corresponding to at least a subset of the input data, into a Q number of logical memories, wherein each of a P number of data elements of the first set of the data elements is stored in each of the Q number of logical memories, wherein P is an integer equal to or greater than one and Q is an integer equal to or greater than N (col 6, lines 5-19: “It will be apparent that pixel values will not be available for all of the locations in the convolution window near the edges of the image … The lack of pixel values can be addressed in several ways.  In one approach, pixels near the edge of the image are not convolved, and the output image is smaller than the source image. This approach is less desirable for large convolution windows.  In another approach, arbitrary pixel values, such as for example, constant values, are used to fill the empty locations in the convolution window.  In still another approach, the pixel values in a row or column at the edge of the image are duplicated and are used to fill the empty locations in the convolution window”  teaches adding arbitrary pixel values via addition or duplication;
 col. 5, lines 15-23 “A circuit configuration for performing a 5 x 5 convolution of an image, using the 5x5 convolver circuit shown in FIGS. 1-3 … is shown schematically in FIG. 4 … Pixel values for the pixels of the convolution window are supplied to the multipliers by pixel buffers 80, 82, 84 and 86.  Each of the pixel buffers may hold the pixel values of one row, or line, of the image …” teaches a row of pixel values in an image stored in pixel buffers [a first logic circuit configured to store a set of data elements, corresponding to at least a subset of the input data]; 
col.3, lines 25-29   “In a 5x5 convolution, the 5x5 multiplier unit 10 receives 25 pixel values of a convolution window and 25 corresponding coefficients of a convolution mask, and produces 25 products R0, Rl, ... R24. The summer 12 combines the products and provides a result S …” and col. 3, lines 53-54 “Memory 20 has sufficient capacity to store an intermediate result for each pixel in the image …”  teaches each of the 25 products/intermediate results  [each of a P number of data elements of the set of data elements] being stored across 25 logical memories [Q logical memories], with P = 25 [wherein P is an integer equal to or greater than one] and Q is greater than one of the dimension of the 5 x 5 convolution mask [Q is an integer equal to or greater than N]),
(2) shifting the first set of data elements from the Q number of logical memories into a first column of an array structure and storing a second set of data elements, corresponding to the subset of input data, into the Q number of logical memories (col. 6, lines 43-63: “ The first pass of the 7x7 convolution is performed as follows. Buffers 80, 82 and 84 load pixel values into the pixel value registers of the multipliers in the first three rows of the convolution window, and multipliers M21, M22, M23 and M24 are loaded with pixel values from the data source. A subset of the coefficients of a 7x7 convolution mask are loaded into the corresponding coefficient registers in each of the multipliers. Multipliers M0, Ml, ... M24 multiply the values in the respective pixel value registers and coefficient registers to provide products. The products R0, Rl ... R24 are combined by summer 12 to provide an intermediate result for the first pixel in the first row of the image. The intermediate result is stored in memory 20 at an address corresponding to the pixel being processed. Then the pixel values in the pixel value registers of each row of the convolution window are shifted one position to the right, new pixel values are shifted into the pixel value registers of multipliers M0, M7 and M14 from buffers 80, 82 and 84, respectively, and a new pixel value is loaded into the pixel value register of multiplier M21 from the data source …” and col. 9, lines 63-65 “ … The pixels of an image are typically convolved row by row. However, the pixels can be convolved column by column, or in any other desired order … “ teaches shifting, on the first pass, pixel values from memory buffers [shifting the first set of data elements from the Q number of logical memories] into the pixel value registers of the array of multipliers M0, M1, … , M24 in multiplier unit, therefore populating at least the first column M0, M7, and M14 of multipliers in multiplier unit [first column of an array structure] and
teaches new pixel values being available in memory buffers after shift of first set of data elements into the array of multipliers M0, M1, … , M24 in multiplier unit [storing a second set of data elements, corresponding to the subset of input data, into the Q number of logical memories]),
	(3) shifting the first set of the data elements from the first column of the array structure into a second column of the array structure and shifting the second set of data elements from the Q number of logical memories into the first column of the array structure (col. 6, lines 43-63: “ The first pass of the 7x7 convolution is performed as follows. Buffers 80, 82 and 84 load pixel values into the pixel value registers of the multipliers in the first three rows of the convolution window, and multipliers M21, M22, M23 and M24 are loaded with pixel values from the data source. A subset of the coefficients of a 7x7 convolution mask are loaded into the corresponding coefficient registers in each of the multipliers. Multipliers M0, Ml, ... M24 multiply the values in the respective pixel value registers and coefficient registers to provide products. The products R0, Rl ... R24 are combined by summer 12 to provide an intermediate result for the first pixel in the first row of the image. The intermediate result is stored in memory 20 at an address corresponding to the pixel being processed. Then the pixel values in the pixel value registers of each row of the convolution window are shifted one position to the right, new pixel values are shifted into the pixel value registers of multipliers M0, M7 and M14 from buffers 80, 82 and 84, respectively, and a new pixel value is loaded into the pixel value register of multiplier M21 from the data source …” and col. 9, lines 63-65 “ … The pixels of an image are typically convolved row by row. However, the pixels can be convolved column by column, or in any other desired order … “ teaches the pixel values in pixel value registers from first pass being shifted one position to the right in the array of multipliers in the multiplier unit [shifting the first set of the data elements from the first column of the array structure into a second column of the array structure] and 
	teaches new pixel values being shifted from memory buffers [shifting the second set of data elements from the Q number of logical memories] into the pixel value registers of the array of multipliers M0, M7, and M14 in multiplier unit [first column of an array structure]), and
	(4) repeating storing and shifting steps using additional data elements corresponding to the subset of the input data until the subset of the input data is expanded based on a predetermined factor selected at least to increase utilization of the plurality of the multiplier circuits (col. 6, lines 43-67 & col 7, lines 1-3: “ The first pass of the 7x7 convolution is performed as follows. Buffers 80, 82 and 84 load pixel values into the pixel value registers of the multipliers in the first three rows of the convolution window, and multipliers M21, M22, M23 and M24 are loaded with pixel values from the data source. A subset of the coefficients of a 7x7 convolution mask are loaded into the corresponding coefficient registers in each of the multipliers. Multipliers M0, Ml, ... M24 multiply the values in the respective pixel value registers and coefficient registers to provide products. The products R0, Rl ... R24 are combined by summer 12 to provide an intermediate result for the first pixel in the first row of the image. The intermediate result is stored in memory 20 at an address corresponding to the pixel being processed. Then the pixel values in the pixel value registers of each row of the convolution window are shifted one position to the right, new pixel values are shifted into the pixel value registers of multipliers M0, M7 and M14 from buffers 80, 82 and 84, respectively, and a new pixel value is loaded into the pixel value register of multiplier M21 from the data source. The multiplications for the second pixel in the first row are performed, and the products R0, R1 … R24 are combined by summer 12 to provide an intermediate result for the second pixel in the first row of the image … This process is repeated for each pixel in the image until the intermediate result for each pixel has been loaded into memory 20, thus completing the first pass of the 7 x 7 convolution …” teaches repeating the storing and shifting steps using all pixels in the image; 
col 6, lines 5-19: “It will be apparent that pixel values will not be available for all of the locations in the convolution window near the edges of the image … The lack of pixel values can be addressed in several ways.  In one approach, pixels near the edge of the image are not convolved, and the output image is smaller than the source image. This approach is less desirable for large convolution windows.  In another approach, arbitrary pixel values, such as for example, constant values, are used to fill the empty locations in the convolution window.  In still another approach, the pixel values in a row or column at the edge of the image are duplicated and are used to fill the empty locations in the convolution window”  teaches adding arbitrary pixel values via addition or duplication [expand at least the subset of input data] until the number of empty locations in the convolution window are filled [based on a predetermined 
Regarding Claim 15,
	Jordan et al. teaches a neural network processor configured to perform convolution operations on input data and N by N matrices, where N is a positive integer greater than one (col. 1, lines 44-46: “ … a  reconfigurable convolver for performing a convolution of pixels of an image is provided …” teaches a reconfigurable convolver [neural network processor] performing  convolution of pixels of an image [convolution operations on input data];  
col 1, lines 14-17 “ … Convolution is a weighted sum of pixels in the neighborhood of a source pixel. The weights are determined by a matrix of coefficients called a convolution mask or convolution kernel, which is usually square …” teaches a convolution mask comprising a matrix of coefficients [N by N matrices] used in convolution process;
col 1, lines 29-32 “ … Performing a 7 x 7 convolution on a two dimensional image requires 29 multiplies and 48 adds for each output pixel generated.  Images that are filtered with a 7 x 7 convolution mask often have 256x256 or 512x512 pixels …” teaches 7 x 7 convolution mask [N by N matrices, wherein N is a positive integer greater than one]), 
the neural network processor comprising:
	a plurality of multiplier circuits (col 3, lines 15-22 “A block diagram of an example of a convolver circuit in accordance with the invention is shown in FIG. 1. Pixel values of a convolution window and coefficients of a convolution mask are supplied to inputs of a multiplier unit 10. The multiplier 10 includes array of multipliers for performing M x Multiplications in parallel.  In the example of FIG. 1, the multiplier unit includes 25 multipliers in a 5x5 multiplier array …” teaches a multiplier unit [a plurality of multiplier circuits]);
	a window expander circuit comprising:
	a first logic circuit configured to store a set of data elements, corresponding to at least a subset of the input data, into a Q number of logical memories, wherein each of a P number of data elements of the set of the data elements is stored in each of the Q number of logical memories, wherein P is an integer equal to or greater than one and Q is an integer equal to or greater than N (col. 5, lines 15-23 “A circuit configuration for performing a 5 x 5 convolution of an image, using the 5x5 convolver circuit shown in FIGS. 1-3 … is shown schematically in FIG. 4 … Pixel values for the pixels of the convolution window are supplied to the multipliers by pixel buffers 80, 82, 84 and 86.  Each of the pixel buffers may hold the pixel values of one row, or line, of the image …” teaches a row of pixel values in an image stored in pixel buffers [a first logic circuit configured to store a set of data elements, corresponding to at least a subset of the input data]; 
col.3, lines 25-29   “In a 5x5 convolution, the 5x5 multiplier unit 10 receives 25 pixel values of a convolution window and 25 corresponding coefficients of a convolution mask, and produces 25 products R0, Rl, ... R24. The summer 12 combines the products and provides a result S …” and col. 3, lines 53-54 “Memory 20 has sufficient capacity to store an intermediate result for each pixel in the image …”  teaches each of the 25 products/intermediate results  [each of a P number of data elements of the set of data elements] being stored across 25 logical memories [Q logical memories], with P = 25 [wherein P is an integer equal to or greater than one] and Q is greater than one of the dimension of the 5 x 5 convolution mask [Q is an integer equal to or greater than N]),  and
col.3, lines 25-29   “In a 5x5 convolution, the 5x5 multiplier unit 10 receives 25 pixel values of a convolution window and 25 corresponding coefficients of a convolution mask, and produces 25 products R0, Rl, ... R24. The summer 12 combines the products and provides a result S …” and col. 3, lines 53-54 “Memory 20 has sufficient capacity to store an intermediate result for each pixel in the image …”  teaches the pixel values [first set of data elements] and 25 corresponding coefficients of a convolution mask [additional data elements] being received by the multiplier unit [second logic circuit] ) 
	and expand the at least the subset of input data until the at least the subset of the input data is expanded based on a predetermined factor (col 6, lines 5-19: “It will be apparent that pixel values will not be available for all of the locations in the convolution window near the edges of the image … The lack of pixel values can be addressed in several ways.  In one approach, pixels near the edge of the image are not convolved, and the output image is smaller than the source image. This approach is less desirable for large convolution windows.  In another approach, arbitrary pixel values, such as for example, constant values, are used to fill the empty locations in the convolution window.  In still another approach, the pixel values in a row or column at the edge of the image are duplicated and are used to fill the empty locations in the convolution window”  teaches adding arbitrary pixel values via addition or duplication [expand at least the subset of input data] until the number of empty locations in the convolution window are filled [based on a predetermined factor]),
	wherein the second logic circuit comprises a rotate circuit (col. 1, lines 46-61: “The convolver comprises a plurality of multipliers for multiplying pixel values of a convolution window by corresponding coefficients of a convolution mask to provide products, a summer coupled to the multipliers for summing the products to provide a result, a memory for storing intermediate results and a controller. The controller comprises means for supplying to the multipliers, during an MxM convolution, pixel values of an MxM convolution window and corresponding coefficients of an MxM convolution mask … The controller further comprises means for supplying to the multipliers, during first pass of an N x N convolution, where N is greater than M, a first subset of pixel values of an NxN convolution window and a first subset of corresponding coefficients of an NxN convolution mask …” teaches controller [rotate circuit]) 
and an array structure ( col 3, lines 15-22 “A block diagram of an example of a convolver circuit in accordance with the invention is shown in FIG. 1. Pixel values of a convolution window and coefficients of a convolution mask are supplied to inputs of a multiplier unit 10. The multiplier 10 includes array of multipliers for performing M x Multiplications in parallel.  In the example of FIG. 1, the multiplier unit includes 25 multipliers in a 5x5 multiplier array …” teaches 5 x 5 multiplier array [an array structure]). 
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing 

Claims 2-3 and 16-17 are rejected under 35 U.S.C. 103 as being unpatentable over Jordan et al.  (US 5,949,920 A) in view of Tan et al. (US 2002/0067417 A1).
Regarding Claim 2,
	Jordan et al. teaches the neural network processor of claim 1.
	Jordan et al. does not appear to explicitly teach wherein the first logic circuit comprises a finite state machine configured to store the data elements corresponding to the at least the subset of the input data into each of the Q logical memories.
	Tan et al. teaches wherein the first logic circuit comprises a finite state machine configured to store the data elements corresponding to the at least the subset of the input data into the each of the Q logical memories (paragraph 0031, “The imaging device 110 further includes a controller 118. The controller 118, which also may be implemented as a finite state machine, operates the imaging device 110 in image capture and readout modes of operation. An image frame is captured by storing code words in pixel memory 16 during the image capture mode and then reading out the pixel memory 16 during the readout mode” teaches a controller of an image device [finite state machine] configured to store code words from image frame [the at least the subset of the input data] in pixel memory [Q logical memories]).
	Jordan et al. and Tan et al. are considered analogous art because they are directed to efficient methods of managing pixel data in digital imaging applications. 
In view of the teachings of Jordan et al. it would have been obvious for a person of ordinary skill in the art to apply the teachings of Tan et al. at the time the application was filed in order to integrate random access memory with complementary metal oxide Tan et al., paragraphs 0008-0009, “[0008] Another problem is converting the CMOS sensor signals to digital prior to storage in the RAM. The amount of silicon area taken up by A/D converters can be quite substantial. Integrating conventional A/D converters with each pixel would greatly increase the size and cost of the sensor chip, especially if high resolution is desired. Reducing the number of A/D converters by, for example, using only one A/D converter per row of pixels would reduce chip size and cost; however, it could create fixed pattern noise in the image. Human eyes are very sensitive to detecting fixed pattern noise. [0009] These problems are overcome by the present invention.”). The Examiner notes that a person of ordinary skill in the art would find a suggestion to perform this type of analysis since Jordan et al. discloses this as a necessary activity for the taught invention (cf. Jordan et al., col. 1, lines 11-40,  “Convolutions are used in image processing to perform low-pass filtering (blurring), high-pass filtering (sharpening), edge detection, edge enhancement and other functions … It is desirable to provide a convolver circuit which performs convolutions at high speed, which can perform convolutions with different convolution window dimensions and which is relatively inexpensive.”).
Regarding Claim 3,
	Jordan et al. in view of Tan et al. teaches the neural network processor of claim 2. 
	Jorden et al. does not appear to explicitly teach wherein the each of the Q logical memories comprises a random-access memory. 
Tan et al. teaches wherein the each of the Q logical memories comprises a random-access memory (paragraph 0031, “The imaging device 110 further includes a controller 118. The controller 118, which also may be implemented as a finite state machine, operates the imaging device 110 in image capture and readout modes of operation. An image frame is captured by storing code words in pixel memory 16 during the image capture mode and then reading out the pixel memory 16 during the readout mode” teaches a controller of an image device configured to store code words from image frame in pixel memory [Q logical memories]; paragraph 0041, “The pixel memory 16 may be ferroelectric random access memory …” teaches the pixel memory [each of the  logical Q memories] comprising ferroelectric random access memory [comprising random access memory]).
Jordan et al. and Tan et al. are combinable for the same rationale as set forth above with respect to claim 2.
Regarding Claim 16,
	Jordan et al. teaches the neural network processor of claim 15.
	Jordan et al. does not appear to explicitly teach wherein the first logic circuit comprises a finite state machine configured to store the data elements corresponding to the subset of the input data into each of the Q logical memories.
	Tan et al. teaches wherein the first logic circuit comprises a finite state machine configured to store the data elements corresponding to the subset of the input data into the each of the Q logical memories (paragraph 0031, “The imaging device 110 further includes a controller 118. The controller 118, which also may be implemented as a finite state machine, operates the imaging device 110 in image capture and readout modes of operation. An image frame is captured by storing code words in pixel memory 16 during the image capture mode and then reading out the pixel memory 16 during the readout mode” teaches a controller of an image device [finite state machine] configured to store code words from image frame [the at least the subset of the input data] in pixel memory [Q logical memories]).
Jordan et al. and Tan et al. are combinable for the same rationale as set forth above with respect to claim 2.
Regarding Claim 17,
	Jordan et al. in view of Tan et al. teaches the neural network processor of claim 16. 
	Jorden et al. does not appear to explicitly teach wherein the each of the Q logical memories comprises a random-access memory. 
	Tan et al. teaches wherein the each of the Q logical memories comprises a random-access memory (paragraph 0031, “The imaging device 110 further includes a controller 118. The controller 118, which also may be implemented as a finite state machine, operates the imaging device 110 in image capture and readout modes of operation. An image frame is captured by storing code words in pixel memory 16 during the image capture mode and then reading out the pixel memory 16 during the readout mode” teaches a controller of an image device configured to store code words from image frame in pixel memory [Q logical memories]; paragraph 0041, “The pixel memory 16 may be ferroelectric random access memory …” teaches the pixel memory [each of the  logical Q memories] comprising ferroelectric random access memory [comprising random access memory]).
Jordan et al. and Tan et al..
Claims 5 and 12 are rejected under 35 U.S.C. 103 as being unpatentable over Jordan et al.  (US 5,949,920 A) in view of Yu et al. (“Vector Processing as a Soft-core CPU Accelerator”).
Regarding Claim 5,
	Jordan et al. teaches the neural network processor of claim 1.
	Jordan et al. does not appear to explicitly teach … further comprising a vector register file configured to store expanded data.
	Yu et al. teaches … further comprising a vector register file configured to store expanded data (p. 2, section 2, paragraphs 1 and 2, “Vector processors have traditionally excelled in scientific and engineering applications … The vector processing model operates on vectors of data. Each vector instruction specifies one operation on the entire vector, generating tens of operations on independent data elements and producing tens of results at a time. Data to be operated on is stored in a large vector register file that can hold a moderate number of vector registers, each containing a large number of data elements. Entire vectors can be gathered from main memory to the vector register file through vector load instructions, and scattered to memory through vector store instructions …” teaches vector register file storing large numbers of data elements through vector store instructions [vector register file configured to store expanded data]).
	Jordan et al. and Yu et al. are considered analogous art because they are directed to increasing performance of CPUs for parallel data applications. 
	In view of the teachings of Jordan et al. it would have been obvious for a person of ordinary skill in the art to apply the teachings of Yu et al. at the time the application was filed in order to use vector processing to simplifying the programming and scale performance with Yu et al., p. 10, section 7, paragraph 3, “A soft-core vector processor is most suitable when rapid development time is required, or when a hardware designer is not available, or when several different applications must share a single accelerator or a single FPGA bitstream. It offers a simple programming model that can be readily understood by software developers with little or no hardware design knowledge. It is also easy to scale performance with little or no change to the software by only modifying a few simple processor parameters. Scaling the number of vector lanes naturally offers both more memory bandwidth (at the register file) and more functional units.”). The Examiner notes that a person of ordinary skill in the art would find a suggestion to perform this type of analysis since Jordan et al. discloses this as a necessary activity for the taught invention (cf. Jordan et al., col. 1, lines 11-40,  “Convolutions are used in image processing to perform low-pass filtering (blurring), high-pass filtering (sharpening), edge detection, edge enhancement and other functions … It is desirable to provide a convolver circuit which performs convolutions at high speed, which can perform convolutions with different convolution window dimensions and which is relatively inexpensive.”).
Regarding Claim 12,
	Jordan et al. teaches the method of claim 8.
	Jordan et al. does not appear to explicitly teach … further comprising storing expanded data into a vector register file corresponding to the neural network processor.
	Yu et al. teaches … further comprising storing expanded data into a vector register file corresponding to the neural network processor (p. 2, section 2, paragraphs 1 and 2, “Vector processors have traditionally excelled in scientific and engineering applications … The vector processing model operates on vectors of data. Each vector instruction specifies one operation on the entire vector, generating tens of operations on independent data elements and producing tens of results at a time. Data to be operated on is stored in a large vector register file that can hold a moderate number of vector registers, each containing a large number of data elements. Entire vectors can be gathered from main memory to the vector register file through vector load instructions, and scattered to memory through vector store instructions …” teaches vector register file storing large numbers of data elements through vector store instructions [storing expanded data into a vector register file]).
Jordan et al. and Yu et al. are combinable for the same rationale as set forth above with respect to claim 5.
Claims 6-7, 13-14, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Jordan et al.  (US 5,949,920 A) in view of Sprangle et al. (US 2013/0232318 A1). 
Regarding Claim 6,
	Jordan et al. teaches the neural network processor of claim 1.
	Jordan et al. does not appear to explicitly teach … further configured to receive the input data via a PCI express bus. 
	Sprangle et al. teaches … further configured to receive the input data via a PCI express bus (Figure 1 
    PNG
    media_image1.png
    912
    790
    media_image1.png
    Greyscale
 and paragraph 0017-0018, “[0017] In the embodiment of FIG. 1, processor 22 resides on an expansion module 300 (e.g., an adapter card) that communicates with processing unit 24 via a peripheral component interconnect (PCI) express (PCIe) interface … [0018] In the embodiment of FIG. 1, processing core 31 includes one or more register files 150. Register files 150 include various vector registers (e.g., vector register V1, vector register V2, ... , vector register V)”  teaches vector register files [vector data memory] within processor configured to receive vector data via PCI express interface [PCI express bus]).
	Jordan et al. and Sprangle et al. are considered analogous art because they are directed to increasing performance of CPUs for parallel data applications.
	In view of the teachings of Jordan et al. it would have been obvious for a person of ordinary skill in the art to apply the teachings of Sprangle et al. at the time the application was filed in order to enable parallel processing in one processor core on each clock cycle, therefore increasing efficiency of the processor core circuitry  (cf. Sprangle et al., paragraph 0071, “In one embodiment, processing core 31 can execute 5 pipelines in parallel, with each pipeline having 5 stages. Processing may proceed from each stage to the next on each clock cycle or tick in processing core 31. Consequently, processing core 31 can efficiently use the circuitry for each stage, for instance by fetching the next instruction as soon as the current instruction moves from fetch stage 120 to read stage 122. In other embodiments, processing cores may use fewer pipelines or more pipelines, and the pipelines may use fewer stages or more stages.”). The Examiner notes that a person of ordinary skill in the art would find a suggestion to perform this type of analysis since Jordan et al. discloses this as a necessary activity for the taught invention (cf. Jordan et al., col. 1, lines 11-40,  “Convolutions are used in image processing to perform low-pass filtering (blurring), high-pass filtering (sharpening), edge detection, edge enhancement and other functions … It is desirable to provide a convolver circuit which performs convolutions at high speed, which can perform convolutions with different convolution window dimensions and which is relatively inexpensive.”).
Regarding Claim 7,
	Jordan et al. teaches the neural network processor of claim 1.
	Jordan et al. does not appear to explicitly teach … further configured to receive the input data from a vector data memory, wherein the vector data memory is configured to receive the input data via a PCI express bus.
	Sprangle et al. teaches …  further configured to receive the input data from a vector data memory, wherein the vector data memory is configured to receive the input data via a PCI express bus (Figure 1 
    PNG
    media_image1.png
    912
    790
    media_image1.png
    Greyscale
 and paragraph 0017-0018, “[0017] In the embodiment of FIG. 1, processor 22 resides on an expansion module 300 (e.g., an adapter card) that communicates with processing unit 24 via a peripheral component interconnect (PCI) express (PCIe) interface … [0018] In the embodiment of FIG. 1, processing core 31 includes one or more register files 150. Register files 150 include various vector registers (e.g., vector register V1, vector register V2, … , vector register V)”  teaches vector register files [vector data memory] within processor configured to receive vector data via PCI express interface [PCI express bus]).
Jordan et al. and Sprangle et al. are combinable for the same rationale as set forth above with respect to claim 6.
Regarding Claim 13,
	Jordan et al. teaches the method of claim 8.
	Jordan et al. does not appear to explicitly teach … further comprising receiving the input data via a PCI express bus. 
	Sprangle et al. teaches … further comprising receiving the input data via a PCI express bus (Figure 1 
    PNG
    media_image1.png
    912
    790
    media_image1.png
    Greyscale
 and paragraph 0017-0018, “[0017] In the embodiment of FIG. 1, processor 22 resides on an expansion module 300 (e.g., an adapter card) that communicates with processing unit 24 via a peripheral component interconnect (PCI) express (PCIe) interface … [0018] In the embodiment of FIG. 1, processing core 31 includes one or more register files 150. Register files 150 include various vector registers (e.g., vector register V1, vector register V2, ... , vector register V)”  teaches vector register files [vector data memory] within processor configured to receive vector data via PCI express interface [PCI express bus]).
Jordan et al. and Sprangle et al. are combinable for the same rationale as set forth above with respect to claim 6.
Regarding Claim 14,
	Jordan et al. teaches the method of claim 8.
	Jordan et al. does not appear to explicitly teach … further comprising receiving the input data from a vector data memory, wherein the vector data memory is configured to receive the input data via a PCI express bus.
	Sprangle et al. teaches … further comprising receiving the input data from a vector data memory, wherein the vector data memory is configured to receive the input data via a PCI express bus (Figure 1 
    PNG
    media_image1.png
    912
    790
    media_image1.png
    Greyscale
 and paragraph 0017-0018, “[0017] In the embodiment of FIG. 1, processor 22 resides on an expansion module 300 (e.g., an adapter card) that communicates with processing unit 24 via a peripheral component interconnect (PCI) express (PCIe) interface … [0018] In the embodiment of FIG. 1, processing core 31 includes one or more register files 150. Register files 150 include various vector registers (e.g., vector register V1, vector register V2, ... , vector register V)”  teaches vector register files [vector data memory] within processor configured to receive vector data via PCI express interface [PCI express bus]).
Jordan et al. and Sprangle et al. are combinable for the same rationale as set forth above with respect to claim 6.
Regarding Claim 20,
	Jordan et al. teaches the neural network processor of claim 15.
	Jordan et al. does not appear to explicitly teach … further configured to receive the input data via a PCI express bus. 
	Sprangle et al. teaches … further configured to receive the input data via a PCI express bus (Figure 1 
    PNG
    media_image1.png
    912
    790
    media_image1.png
    Greyscale
 and paragraph 0017-0018, “[0017] In the embodiment of FIG. 1, processor 22 resides on an expansion module 300 (e.g., an adapter card) that communicates with processing unit 24 via a peripheral component interconnect (PCI) express (PCIe) interface … [0018] In the embodiment of FIG. 1, processing core 31 includes one or more register files 150. Register files 150 include various vector registers (e.g., vector register V1, vector register V2, ... , vector register V)”  teaches expansion module [windows expander circuit] configured to receive vector data via PCI express interface [PCI express bus]).
Jordan et al. and Sprangle et al. are combinable for the same rationale as set forth above with respect to claim 6.
Claims 4, 9, 11, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Jordan et al.  (US 5,949,920 A) in view of Abedallah Ali et al.  (“A Generic Pixel Distribution Architecture for Parallel Video Processing”). 
Regarding Claim 4,
Jordan et al. teaches the neural network processor of claim 1. 
Jordan et al. does not appear to explicitly teach wherein the second logic circuit comprises a rotate circuit and an array structure. 
Abedallah Ali et al. teaches wherein the second logic circuit comprises a rotate circuit and an array structure (p. 4, Figure 3 
    PNG
    media_image2.png
    849
    1241
    media_image2.png
    Greyscale
teaches the second logic circuit of a pixel distributor comprising the circular vertical shifter [rotate circuit] and the horizontal shift register [array structure]).
Jordan et al. and Abedallah Ali et al. are considered analogous art because they are directed to increasing performance of CPUs for parallel data applications.
	In view of the teachings of Jordan et al. it would have been obvious for a person of ordinary skill in the art to apply the teachings of Abedallah Ali et al. at the time the application was filed in order to provide a generic model for pixel distribution dedicated for streaming video applications that offers a parallel hardware architecture, thus leading to lower design complexity and higher development productivity (cf. Abedallah Ali et al., p.2, section I, paragraph 4, “ … we propose a generic hardware model to implement a flexible pixel distributor that can be Jordan et al. discloses this as a necessary activity for the taught invention (cf. Jordan et al., col. 1, lines 11-40,  “Convolutions are used in image processing to perform low-pass filtering (blurring), high-pass filtering (sharpening), edge detection, edge enhancement and other functions … It is desirable to provide a convolver circuit which performs convolutions at high speed, which can perform convolutions with different convolution window dimensions and which is relatively inexpensive.”).
Regarding Claim 9,
	Jordan et al. teaches the method of claim 8.
Jordan et al. does not appear to explicitly teach wherein the storing and the shifting steps are performed using a window expander circuit comprising a first logic circuit, wherein the first logic circuit comprises a finite state machine configured to store the data elements corresponding to the at least the subset of the input data into the each of the Q logical memories.
Abedallah Ali et al. teaches wherein the storing and the shifting steps are performed using a window expander circuit comprising a first logic circuit (p. 4, Figure 3 
    PNG
    media_image2.png
    849
    1241
    media_image2.png
    Greyscale
teaches a pixel distributor [window expander circuit] using a controller to perform storing steps into line buffers [Q data memories] and shifting steps using a horizontal shift register and circular vertical shifter; 
p. 5, section IV(B), paragraph 2 “ The pixel distributor consists of the following internal blocks: (i) the line buffers for storing the input pixels, (ii) the circular vertical shifter for shifting the pixels circularly in the vertical direction, while (iii) the horizontal shift register for shifting the pixels horizontally, (iv) the controller for asserting the required control signals according to the current state of the system; for example, the controller asserts wr_en_buff signal to enable writing in one of the line buffers at a specified address wr_addr, while it loads rd-_addr for read operations; the controller assigns sof, valid and ver_shifting signals for indicating the start of the frame, the presence of a valid macroblock or for shifting the pixels vertically.”  and  p. 5, section IV(C), paragraphs 1-2 
    PNG
    media_image3.png
    598
    604
    media_image3.png
    Greyscale
 teaches the pixel distributor [window expander circuit] comprising a controller [first logic circuit] and the controller comprising a finite state machine).
Jordan et al. and Abedallah Ali et al. are combinable for the same rationale as set forth above with respect to claim 4.
Regarding Claim 11,
	Jordan et al. in view of Abedallah Ali et al. teaches the method of claim 9.
Abedallah Ali et al. further teaches wherein the window expander circuit comprises a
p. 4, Figure 3 
    PNG
    media_image2.png
    849
    1241
    media_image2.png
    Greyscale
teaches the circular vertical shifter [rotate circuit] coupled between the line buffers [Q logical memories] and the horizontal shift register [array structure]).
Jordan et al. and Abedallah Ali et al. are combinable for the same rationale as set forth above with respect to claim 4.
Regarding Claim 18,
	Jordan et al. teaches the neural network processor of claim 15.
Jordan et al. does not appear to explicitly teach wherein the rotate circuit is configured to selectively rotate the at least the subset of the input data before providing the at least the subset of the input data to the array structure.
Abedallah Ali et al. teaches wherein the rotate circuit is configured to selectively rotate the at least the subset of the input data before providing the at least the subset of the input data to the array structure (p. 4, Figure 3 
    PNG
    media_image2.png
    849
    1241
    media_image2.png
    Greyscale
teaches the circular vertical shifter [rotate circuit] coupled between the line buffers  and the horizontal shift register [array structure];  p. 5, section IV (B), paragraphs 3-4 “A column of pixels is passed to the circular vertical shifter as soon as, its last pixel was written to the line buffers. The horizontal shift register shifts each pixel horizontally so that after hor_slide shifts for the first pixel of the macro-block (i.e. pixel<1>), the valid signal is asserted to indicate the presence of a macro-block at the output ports of the pixel distributor.
.. the line of index V+1 will be stored in the first line buffer. If ver_slide < V, then the line V+1 will have some order in the macro-block rather than being the first line. In this case, the output of the line buffers are needed to be shifted vertically in a circular way to put back the lines of the macro-block in their correct order. Every V lines, the signal ver_shifting is asserted ver_slide times” teaches the circular vertical shifter [rotate circuit] selectively shifting data from line buffers vertically in a circular fashion [selectively rotate the at least the subset of the input data] prior to being received by horizontal shift register [array structure]).
Jordan et al. and Abedallah Ali et al. are combinable for the same rationale as set forth above with respect to claim 4.
Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Jordan et al.  (US 5,949,920 A) in view of Abedallah Ali et al.  (“A Generic Pixel Distribution Architecture for Parallel Video Processing”) and in further view of Tan et al. (US 2002/0067417 A1).
Regarding Claim 10,
	Jordan et al. in view of Abedallah Ali et al. teaches the method of claim 9.
	Jordan et al. in view of Abedallah Ali et al. does not appear to explicitly teach wherein each of the Q logical memories comprises a random-access memory.
	Tan et al. teaches wherein the each of the Q logical memories comprises a random-access memory (paragraph 0031, “The imaging device 110 further includes a controller 118. The controller 118, which also may be implemented as a finite state machine, operates the imaging device 110 in image capture and readout modes of operation. An image frame is captured by storing code words in pixel memory 16 during the image capture mode and then reading out the pixel memory 16 during the readout mode” teaches a controller of an image device configured to store code words from image frame in pixel memory [Q logical memories]; paragraph 0041, “The pixel memory 16 may be ferroelectric random access memory …” teaches the pixel memory [each of the  logical Q memories] comprising ferroelectric random access memory [comprising random access memory]).
Jordan et al., Abedallah Ali et al., and Tan et al. are considered analogous art because they are directed to efficient methods of managing pixel data in digital imaging applications. 
In view of the teachings of Jordan et al. in view of Abedallah Ali et al.  it would have been obvious for a person of ordinary skill in the art to apply the teachings of Tan et al. at the time the application was filed in order to integrate random access memory with complementary metal oxide semiconductor (CMOS) imaging devices, thus boosting the signal strength of the CMOS pixel sensor and reducing fixed pattern noise in the images (cf. Tan et al., paragraphs 0007-0009, “[0007] It would be desirable to integrate random access memory ("RAM") onto the sensor chip. However, there are certain problems to overcome. One problem is voltage sag. A typical CMOS active pixel sensor includes multiple stages of charge amplifiers. Each stage reduces strength of the sensor signal. 
 [0008] Another problem is converting the CMOS sensor signals to digital prior to storage in the RAM. The amount of silicon area taken up by ND converters can be quite substantial. Integrating conventional AID converters with each pixel would greatly increase the size and cost of the sensor chip, especially if high resolution is desired. Reducing the number of ND converters by, for example, using only one ND converter per row of pixels would reduce chip size and cost; however, it could create fixed pattern noise in the image. Human eyes are very sensitive to detecting fixed pattern noise. 
Jordan et al. discloses this as a necessary activity for the taught invention (cf. Jordan et al., col. 1, lines 11-40,  “Convolutions are used in image processing to perform low-pass filtering (blurring), high-pass filtering (sharpening), edge detection, edge enhancement and other functions … It is desirable to provide a convolver circuit which performs convolutions at high speed, which can perform convolutions with different convolution window dimensions and which is relatively inexpensive.”).
Claims 19 is rejected under 35 U.S.C. 103 as being unpatentable over Jordan et al.  (US 5,949,920 A) in view of Abedallah Ali et al.  (“A Generic Pixel Distribution Architecture for Parallel Video Processing”) and in further view of Lu et al. (“Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs”).
Regarding Claim 19,
	Jordan et al. in view of Abedallah Ali et al. teaches the neural network processor of claim 18. 
Jordan et al. in view of Abedallah Ali et al. does not appear to teach wherein an extent of a rotation of the at least the subset of the input data is determined based on a stride associated with the convolution operations.
Lu et al. teaches wherein an extent of a rotation of the at least the subset of the input data is determined based on a stride associated with the convolution operations (p. 103, section III, paragraph 1, “ … we propose a FPGA accelerator design for CNNs based on two-dimensional Winograd algorithm …” and p. 104, section III(B), paragraph 2 “ To reuse the data, we store n + m input lines in on-chip memory in total and rotate the lines as a circular buffer. More clearly, initially, Winograd engines will read the first n lines from the line buffer directly, meanwhile the next m lines of the line buffer will load data from external memory. The computation of the n lines and the transfer of m lines are done in parallel by employing the double buffer design. Note that the stride between two neighboring tiles in Winograd algorithm is m. Therefore, Winograd PE engines will skip the next m lines and process the following n lines from the line buffer and the skipped m lines will be overwritten by the new load data from the external memory. During this process, if it reaches the bottom of the line buffer, it will rotate to the beginning of the line buffer” teaches timing of reading of data from on-chip memory on circular fashion [extent of rotation] dependent on timing associated with skipping number of lines of memory buffer that read input data directly from external memory [stride associated with convolution operations]).
Jordan et al., Abedallah Ali et al., and Lu et al. are considered analogous art because they are directed to efficient methods of managing pixel data in digital imaging applications. 
In view of the teachings of Jordan et al. in view of Abedallah Ali et al.  it would have been obvious for a person of ordinary skill in the art to apply the teachings of Lu et al. at the time the application was filed in order to provide an architecture for efficient implementation of CNNs using Winograd algorithm on FPGAs  (cf. Lu et al., p. 102, section II (B) paragraph 1, “The trends of CNNs are moving towards deeper topologies
with small filters. The conventional convolution algorithm is general, but less efficient. As an alternative, convolution can be implemented more efficiently using Winograd
minimal filtering algorithm”). The Examiner notes that a person of ordinary skill in the art would find a suggestion to perform this type of analysis since Jordan et al. discloses Jordan et al., col. 1, lines 11-40,  “Convolutions are used in image processing to perform low-pass filtering (blurring), high-pass filtering (sharpening), edge detection, edge enhancement and other functions … It is desirable to provide a convolver circuit which performs convolutions at high speed, which can perform convolutions with different convolution window dimensions and which is relatively inexpensive.”).
Prior Art
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure:  Stitt et al. (“A Parallel Sliding-Window Generator for High-Performance Digital-Signal Processing on FPGAs”) teaches a parallel sliding-window generator capable of producing a number of parallel windows equal to the number of inputs provided every cycle..
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to CHIAKA CHUKWUMA OKOROH whose telephone number is (571)272-3710.  The examiner can normally be reached on M - F 7:30 AM - 4:30 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar can be reached on 571-272-7796.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.

/CHIAKA CHUKWUMA OKOROH/Examiner, Art Unit 2125                                            
/MICHAEL J HUNTLEY/Primary Examiner, Art Unit 2116