DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Objections
Claim 9 objected to because of the following informalities:  A period is missing at the end of the claim.  Appropriate correction is required.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-2 and 8-9 are rejected under 35 U.S.C. 103 as being unpatentable over Blanc et al (US 9,164,807) in view of Meixner (US PG PUB 2016/0,313,984).
Regarding Claim 1, Blanc et al teach a general-purpose graphics processing unit (chip 30; Figs 4, 5, Claim 1 and col 12, ln 34 – col 13, ln 28, col 15, ln 51-66) comprising: a processing array (processing units) including multiple compute blocks (computational unit cluster CI0 – CI15), each compute block including multiple processing clusters (each cluster contains multiple programmable processor elements (PE) 40, 41, 42, 43); a memory module (Dynamic random-access memory (DRAM) 34); and a fabric interface (cluster manager 60) to facilitate data exchange with an external processor over a communication fabric, the fabric interface to transfer data stored in the memory module to an external (data exchange with the direct memory access (DMA) interface 61 of the cluster manager 60 to the local network on chip (NoC) 31 to transfer data from PE/DRAM controllers 32 and 33 to the DRAM 34).
Blanc et al does not teach to enable computation operations on a halo region, the compute operations on the halo region having a dependency on data stored in the memory module.  
	Meixner is analogous art pertinent to the technology addressed in this application including a general-purpose graphics processing unit (image processor can be characterized as a graphics processing unit; ¶ [0185]) to enable computation operations on a halo region (process pixel values outside two dimensional shift register array 1006 into the halo region 1009; Fig 10 and ¶ [0121]-[0124]), the compute operations on the halo region having a dependency on data stored in the memory module (data processed in halo region stored in random access memory 1007).
It would have been obvious to one of ordinary skill in the art to combine the teachings of Blanc et al with Meixner before the effective filing date including to enable computation operations on a halo region, the compute operations on the halo region having a dependency on data stored in the memory module.  Use of a halo region allows for “spill-over” space for data analyzed, thereby making computing more efficient and accurate, as recognized by Meixner (¶ [0124]). 
Regarding Claim 2, Blanc et al in combination with Meixner teach the general-purpose graphics processing unit as in claim 1 (as described above), and Blanc et al further teaches an instruction fetch unit to fetch an instruction for processing via the processing array (fetch mechanism used for preparing data locally for the PE; col 23, ln 36 - 64) and a scheduler unit to schedule the instruction for processing via the processing array (task manager within cluster manager can have programmed instructions for processing; col 27, ln 28 – 42).  
Regarding Claim 8, Blanc et al in combination with Meixner teach the general-purpose graphics processing unit as in claim 1 (as described above), and Blanc et al further teaches wherein the external (controllers 32 and 33 are processor elements (PEs); col. 15, ln 51-66).  
Regarding Claim 9, Blanc et al in combination with Meixner teach the general-purpose graphics processing unit as in claim 1 (as described above), wherein Blanc et al further teaches a cache memory and logic configurable to monitor a cache line within the cache memory (each computational unit cluster CI0 – CI15 comprises local memory with one or more memory blocks 44-59; Figs 4, 5 and col 13, ln 7-28, col 14, ln 4-15, col 16, ln 51-66), the cache memory to cache data stored in the memory module, the cache line associated with a memory address within the memory module (memory blocks 44-59 are managed by a memory manager 63 and data from tasks can be transferred to DRAM through the NoC; col 28, ln 47-52), and Meixner further teaches the memory address is to store data associated with the dependency of the compute operation on the halo region (data from halo region 1009 stored in RAM 1007 is associated with the vertical and horizontal position of the execution lane array 1005; ¶ [0121]-[0124]).
Claims 3-7 are rejected under 35 U.S.C. 103 as being unpatentable over Blanc et al (US 9,164,807) in view of Meixner (US PG PUB 2016/0,313,984) and in further view of Shacham et al (US PG PUB 2018/0005,074).
Regarding Claim 3, Blanc et al in combination with Meixner teach the general-purpose graphics processing unit as in claim 2 (as described above), and Blanc et al teaches the scheduler unit to schedule a set of instructions to the processing array (task manager 62 within cluster manager 60 can have programmed instructions for processing; col 27, ln 28 – 42) and Meixner teaches computation operations on a halo region (process pixel values outside two dimensional shift register array 1006 into the halo region 1009; Fig 10 and ¶ [0121]-[0124]).
Blanc et al in combination with Meixner does not teach the processing array to accelerate distributed training of a neural network, the instructions to cause the processing array to: perform a parallel convolution operation on a partition of a feature map associated with a neural network to generate intermediate data; storing the intermediate data within the memory module; and transferring the intermediate data via the fabric interface, the intermediate data including a dependency of the compute operation on the halo region.  
Shacham et al is analogous art pertinent to the technology addressed in this application including to accelerate distributed training of a neural network (CNN with multiple 3D convolutions with N planes and Mth depth will accelerate distribution of computation; ¶ [0074]-[0077] and Figs 7,8), the instructions to cause the processing array to: perform a parallel convolution operation on a partition of a feature map associated with a neural network to generate intermediate data (convolution is performed on multiple planes simultaneously to generate intermediate resultant planes; ¶ [0078]-[0092] and Figs 9a, 9b); storing the intermediate data within the memory module (intermediate data is stored within the stencil processor unit); and transferring the intermediate data via the fabric interface (data is feed into stencil processor’s RAM before processing next sheet), the intermediate data including a dependency of the compute operation on the halo region (each stencil position includes a “halo” region during computation and stencil position slides during operation towards generation of the intermediate plane resultant value).  
It would have been obvious to one of ordinary skill in the art to combine the teachings of Blanc et al and Meixner with Shacham et al before the effective filing date including instructions to cause the processing array to: perform a parallel convolution operation on a partition of a feature map associated with a neural network to generate intermediate data; storing the intermediate data within the memory module; and transferring the intermediate data via the fabric interface, the intermediate data including a dependency of the compute operation on the halo region.  Use of a CNN on a impage processor allows for local respective register space and concurrently multiplying within execution lanes, thereby saving processing time, as recognized by Shacham et al (¶ [0006]-[0007]).
Regarding Claim 4, Blanc et al in combination with Meixner and Shacham et al teach the general-purpose graphics processing unit as in claim 3 (as described above), wherein Shacham et al further teaches the processing array further to multi-dimensionally partition the feature map associated with a neural network (CNN layer includes multiple blocks of coefficients; ¶ [0074]-[0077] and Fig 8, 9a).  
Regarding Claim 5, Blanc et al in combination with Meixner and Shacham et al teach the general-purpose graphics processing unit as in claim 4 (as described above), wherein Shacham et al further teaches to multi-dimensionally partition the feature map includes partitioning the feature map in a horizontal and vertical dimension (CNN layer includes multiple blocks of coefficients with N planes in the x and y direction; ¶ [0074]-[0077] and Fig 8, 9a).  
Regarding Claim 6, Blanc et al in combination with Meixner and Shacham et al teach the general-purpose graphics processing unit as in claim 5 (as described above), wherein Shacham et al further teaches to multi-dimensionally partition feature map includes to partition the feature map in a depth or Z dimension (CNN layer includes multiple blocks of coefficients with N planes in the x and y direction with Mth depth; ¶ [0074]-[0077] and Fig 8, 9a).
Regarding Claim 7, Blanc et al in combination with Meixner and Shacham et al teach the general-purpose graphics processing unit as in claim 6 (as described above), wherein Shacham et al further teaches the processing array further to transmit a partition of the feature map to the external processor via the fabric interface (image data can be loaded from the stencil processor RAM to the image processor’s 2D register structure; ¶ [0086]).  
Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Blanc et al (US 9,164,807) in view of Meixner (US PG PUB 2016/0,313,984) and in further view of Mueller (DE 10353532).
Regarding Claim 10, Blanc et al in combination with Meixner and Shacham et al teach the general-purpose graphics processing unit as in claim 9 (as described above), wherein Blanc et al teaches a cache memory and logic configurable to monitor a cache line within the cache memory (each computational unit cluster CI0 – CI15 comprises local memory with one or more memory blocks 44-59; Figs 4, 5 and col 13, ln 7-28, col 14, ln 4-15, col 16, ln 51-66) and a fabric interface (cluster manager 60). 
Blanc et al in combination with Meixner does not teach to automatically transfer data upon detection of an update to the data within the cache.
	Mueller is analogous art pertinent to the technology addressed in this application including to automatically transfer data upon detection of an update to the data within the cache (new data in cache 21 can be automatically transferred to controller; ¶ [0014]-[0018]). 
It would have been obvious to one of ordinary skill in the art to combine the teachings of Blanc et al and Meixner with Mueller before the effective filing date including to automatically transfer data upon detection of an update to the data within the cache. By automatically transferring data an increased reliability in the processing is created, thereby improving efficiency and processing accuracy, as recognized by Mueller (¶ [0018]). 

Claims 11-14, 16-17 are rejected under 35 U.S.C. 103 as being unpatentable over Parashar et al (SCNN: Accelerator for Compressed Convolution Neural Networks) in view of Shacham et al (US PG PUB 2018/0005,074). 
Regarding Claim 11, Parashar et al teaches a method of transmitting data between multiple compute nodes of a distributed training system for a neural network (tiling strategy to spread work across array of PEs; Processing Element (PE) Architecture and Inter-PE parallelism), the method comprising: multi-dimensionally partitioning data of a feature map into multiple partitions (PE consists of multiple CNN layers); distributing the multiple partitions across multiple nodes of the distributed training system (multiple vector I inputs); performing a distributed parallel convolution operation on the multiple partitions to train weight data of the neural network (parallel convolution during pruning and training with F filter-weight to create array of partial-sum output, performed in tiles across the PEs); and 103exchanging data between nodes to enable a compute operation for a halo region, the halo region having dependency on data processed by a different node (output halos are incorporated with incomplete partial sums, communicated to neighbor PEs).  
Parashar et al does not teach multi-dimensionally partitioning the data.
Shacham et al is analogous art pertinent to the technology addressed in this application including multi-dimensionally partitioning the data (CNN layer includes multiple blocks of coefficients with N planes in the x and y direction with Mth depth; ¶ [0074]-[0077] and Fig 8, 9a).  
It would have been obvious to one of ordinary skill in the art to combine the teachings of Parashar et al with Shacham before the effective filing date including multi-dimensionally partitioning the data. Convolution of a three-dimension block of data allows for more accurate image analysis and allows for data to be analyzed in an efficient manner, as recognized by Shacham et al (¶ [0073]).
Regarding Claim 12, Parashar et al in combination with Shacham et al teaches the method as in claim 11 (as described above), wherein Shacham et al further teaches multi-dimensionally partitioning data of a feature map across multiple nodes includes partitioning the data of the feature map in a horizontal and vertical dimension (CNN layer includes multiple blocks of coefficients with N planes in the x and y direction; ¶ [0074]-[0077] and Fig 8, 9a).  
Regarding Claim 13, Parashar et al in combination with Shacham et al teaches the method as in claim 12 (as described above), wherein Shacham et al further teaches multi-dimensionally partitioning data of the feature map across multiple nodes additionally includes partitioning data of the feature map in a depth or Z dimension (CNN layer includes multiple blocks of coefficients with N planes in the x and y direction with Mth depth; ¶ [0074]-[0077] and Fig 8, 9a).  
Regarding Claim 14, Parashar et al in combination with Shacham et al teaches the method as in claim 11 (as described above), wherein Shacham et al further teaches the distributed parallel convolution operation is a data parallel convolution operation or a hybrid parallel convolution operation (convolution is performed on multiple planes simultaneously and in parallel planes; ¶ [0078]-[0092] and Figs 9a, 9b).  
Regarding Claim 16, Parashar et al in combination with Shacham et al teaches the method as in claim 14 (as described above), wherein Shacham et al further teaches performing compute operations on the non-halo region before performing compute operations on the halo region (computation is performed in the stencil region before stencil is shifted to the “halo” region and computed;  ¶ [0078]-[0092] and Figs 9a, 9b) and exchanging data between nodes while performing compute operations on the non-halo region node (stencil position slides during operation towards generation of the intermediate plane resultant value thereby exchanging data between nodes).  
Regarding Claim 17, Parashar et al in combination with Shacham et al teaches the method as in claim 16 (as described above), wherein Shacham et al further teaches interleaving compute and communication operations while performing compute operations on the non-halo region (processing is performed simultaneously to generate convolution results; ¶ [0093]).  

Allowable Subject Matter
Claim 15 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Regarding Claim 15, the prior art fails to teach, disclose or suggest: exchanging data between nodes before performing compute operations on the halo region and performing compute operations on the halo region before performing compute operations on a non-halo region.  


Claims 18-20 are allowed.
Regarding Claim 18, the prior art fails to disclose, teach, or suggest the following elements taken as a whole: 
identify a memory address within the memory module associated with a halo region of a layer of a neural network, wherein the halo region is a region of the layer of the neural network in which compute operations have a dependency on data associated with multiple different nodes of the distributed training operation; determine a cache line within the general-purpose graphics processor associated with the memory address; monitor the cache line for updates to the data associated with the halo region; and automatically transfer the updates to the data associated with the halo region to a node of the distributed training operation having a dependency upon the data.  
It is the claim, taken as a whole, including the interrelationships and interconnections between the various elements claimed, that make it allowable over the prior art of record.

Claims 19 -20 are dependent on Claim 18 and are, therefore, allowable.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Rajchl et al (DeepCut Segmentation Convolution Neural Networks) teaches a method to obtain pixel segmentation and includes analysis within the halo region using convolution neural networks. 
Xu et al (CN 107463990) teaches an FPGA of a convolution neural network with a parallel acceleration method.
Chetlur et al (US PG PUB 2016/0062,947) teaches a system and method for parallel processing in multi-convolution operations.
Chakradhar et al (US PG PUB 2011/0029,471) teaches a processor apparatus and method for convolutional neural networks for parallel processing.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to KATHLEEN M BROUGHTON whose telephone number is (571)270-7380.  The examiner can normally be reached on Monday-Friday 8:00-5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Matthew Bella can be reached on 571-272-7778.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/KATHLEEN M BROUGHTON/Examiner, Art Unit 2667   

/MATTHEW C BELLA/Supervisory Patent Examiner, Art Unit 2667