DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-20 are pending in this application.

Response to Arguments
Applicant’s arguments regarding the rejections of claims 2-8 and 10 under 35 U.S.C. 112b have been fully considered and are persuasive. The rejections have been withdrawn. 

Applicant's arguments regarding the 35 U.S.C. 103 rejections of claims 1-20 have been fully considered but they are not persuasive.

Regarding the 35 U.S.C. 103 rejection, the applicant argues the following in the remarks:
Lee in view of Kim does not teach an interface of a respective core of the plurality of cores and the interface configured to receive one or more instructions from the task manager to collect external output feature maps corresponding to the set of parallel computation tasks from other cores of the plurality of cores.
The dependent claims are allowable because they are dependent on allowable independent claims. 
Examiner has thoroughly considered Applicant' s arguments, but respectfully finds them unpersuasive for at least the following reasons:
As to point a, Lee teaches the above recited limitation because Lee recites in paragraph [0064] lines 1-8 that “The data arithmetic circuit 114 convolutes the input features and the accumulating convolution results of FIG. 3 on an output feature map that has already been generated through the convolution on the input feature map, different from the weight map”, in paragraph [0073] lines 5-7 that “each of the plurality of neural network processors included in the processor array 1020 may be the neural network processor 100 of FIGS. 2 and 3”, in paragraph [0080] lines 5-8 that “each of the neural network processors included in the processor groups of the processor array 1020 receives an identical input feature map and different weight maps”, in paragraph [0072] lines 3-5 that “controller 1010 may control operations of the processor array 1020” and in paragraph [0058] lines 1-3 that “The neural network processor 100 stores or outputs the output feature map generated by the data arithmetic circuit 114 to the internal memory”. Lee teaches that the data arithmetic circuit which is within a neural network processor, accumulates convolution results (output feature maps) that have already been generated from convolution with different weight maps. The previously generated convolution results must come from other processors of the array because each processor receives a different weight map. The internal memory is the interface that collects external output feature maps because the internal memory is within a neural network processor and it stores the output feature map which is composed of multiple output feature maps that are generated by other neural network processors. 
As to point b, the examiner respectfully disagrees. Applicant's arguments regarding dependent claims fail to comply with 37 CFR 1.111(b) because they amount to a general allegation that the dependent claims define a patentable invention without specifically pointing out how the language of the dependent claims patentably distinguishes them from the references.

Claim Objections
Claim 5 is objected to because of the following informalities:  “the determination” lacks antecedent basis.  Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claim 5 is rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention. 
As per claim 5:
	 Lines 1-4 recite “in response to the determination that the barrier instructions are received from the respective core of the plurality of cores, the task manager is configured to send a resume instruction to the respective core to resume the generation of the second output feature map” but this is different from what is recited in the specification. Paragraph [056] recites “after each of the plurality of cores, task manager 310 can send a resume instruction to the core to resume the generation of the second output feature map.” Therefore, a barrier instruction from each of the cores must be received before resuming the generation of the second output feature map.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 1 is rejected under 35 U.S.C. 103 as being unpatentable over Lee et al. (US 20180253636 A1 herein Lee) in view of Shin et al. (DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for General-Purpose Deep Neural Networks herein Shin).
Lee was recited in the previous office action.

As per claim 1, Lee teaches the invention substantially as claimed including a system for performing parallel computation ([0073] lines 7-9 the plurality of neural network processors may be implemented to operate in parallel), comprising: 
a task manager (Fig. 5, 1010 controller); and 
a plurality of cores coupled with the task manager and configured to respectively perform a set of parallel computation tasks based on instructions from the task manager, wherein a respective core of the plurality of cores further comprises (Fig. 5, 7A, 7B, 1020 processor array; [0072] lines 3-5 controller 1010 may control operations of the processor array : 
a processing unit configured to generate a first output feature map corresponding to a first computation task among the set of parallel computation tasks ([0077] lines 1-3 Each of the neural network processors of the processor array 1020 convolutes the allocated input feature map and weight map to generate an output feature map; [0091] lines 3-6 a zeroth neural network processor of the zeroth processor group convolutes (as a first computation task) the input feature map A0 and the weight map K0 to generate an output feature map Psum0; [0049] lines 1-4 a convolution operation on the first feature map FM1 and a weight map WM is performed, and as a result the second feature map FM2 is generated); 
an interface configured to receive one or more instructions from the task manager to collect an output feature map corresponding to a parallel computation task; feature map based on the first output feature map (Fig. 3; [0058] lines 1-3 The neural network processor 100 stores or outputs the output feature map generated by the data arithmetic circuit 114 to the internal memory; [0057] lines 1-5 the neural network processor 100 may further include internal memory. The internal memory may be cache memory of the neural network processor 100. The internal memory may be static random access memory (SRAM); [0073] lines 1-9 The processor array 1020 includes a plurality of neural network processors…the plurality of neural network processors may be implemented to operate in parallel, simultanteously; [0072] lines 3-5 controller 1010 may control operations of the processor array 1020; [0073] lines 11-13 each of the neural network processors may be implemented as a core circuit cable of executing In other words, the internal memory which is inside of a neural network processor (a respective core) is an interface and it collects an output feature map.); 

Lee fails to teach a respective core of the plurality of cores further comprises: an interface to collect external output feature maps corresponding to the set of parallel computation tasks from other cores of the plurality of cores; a reduction unit configured to generate a reduced feature map based on received external output feature maps.

However, Shin teaches a respective core of the plurality of cores further comprises: an interface to collect external output feature maps corresponding to the set of parallel computation tasks from other cores of the plurality of cores; a reduction unit configured to generate a reduced feature map based on received external output feature maps (Fig. 14.2.2; left column paragraph 3 lines 2-9 The CP is composed of 4 convolution clusters and 1 aggregation core. Each convolution cluster performs convolution operations with 4 convolution cores, and transfers the accumulation results to the accumulation core… The CP and FRP are able to process 4 different CLs and 8 RLs, respectively, in parallel; 
    PNG
    media_image1.png
    602
    696
    media_image1.png
    Greyscale
 Fig. 14.2.2 shows an image memory (as interface) within the aggregation core which can store accumulation results received from the 4 convolution cores. External output feature maps are taught because convolution produces output feature maps and the aggregation core receives output feature maps from the 4 convolution cores, so the output feature maps are external. A reduction unit is taught because as shown in Fig. 14.2.2, the aggregation core performs pooling which is also known as reduction.).

It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined Lee with the teachings of Shin because Shin’s teaching of a core to collect output feature maps and reduce them allows for 4 different convolutional layers to be processed in parallel without using too much memory (see Shin, left column paragraph 3 lines 8-9  The CP and FRP are able to process 4 different CLs and 8 RLs, .
	
Claim 11 is rejected under 35 U.S.C. 103 as being unpatentable over Lee in view of Kim et al. (US 2018/0197084 Al herein Kim).

As per claim 11, Lee teaches the invention substantially as claimed including a method for performing a set of parallel computation tasks at a core of a plurality of cores coupled with a task manager (Fig. 5, 7A, 7B, 1010 controller, 1020 processor array; [0072] lines 3-5 controller 1010 may control operations of the processor array 1020; [0073] lines 7-9 the plurality of neural network processors may be implemented to operate in parallel), comprising: 
generating, by a processing unit of the core, a first output feature map corresponding to a first computation task among the set of parallel computation tasks ([0077] lines 1-3 Each of the neural network processors of the processor array 1020 convolutes the allocated input feature map and weight map to generate an output feature map; [0091] lines 3-6 a zeroth neural network processor of the zeroth processor group convolutes (as a first computation task) the input feature map A0 and the weight map K0 to generate an output feature map Psum0; [0049] lines 1-4 a convolution operation on the first feature map FM1 and a weight map WM is performed, and as a result the second feature map FM2 is generated); 
receiving, by an interface of the core, one or more instructions from the task manager to collect external output feature maps corresponding to the set of parallel computation tasks from other cores of the plurality of cores (Fig. 3; [0058] lines 1-3 The neural network processor 100 stores or outputs the output feature map generated by the data arithmetic circuit 114 to the internal memory; [0057] lines 1-5 the neural network processor 100 accumulating convolution results of FIG. 3 on an output feature map that has already been generated through the convolution on the input feature map, different from the weight map; [0073] lines 1-9 The processor array 1020 includes a plurality of neural network processors…the plurality of neural network processors may be implemented to operate in parallel, simultanteously; [0072] lines 3-5 controller 1010 may control operations of the processor array 1020; [0073] lines 11-13 each of the neural network processors may be implemented as a core circuit cable of executing instructions; [0073] lines 5-7 each of the plurality of neural network processors included in the processor array 1020 may be the neural network processor 100 of FIGS. 2 and 3; [0080] lines 5-8 each of the neural network processors included in the processor groups of the processor array 1020 receives an identical input feature map and different weight maps); 
feature map based on the first output feature map and received external output feature maps ([0064] lines 1-8 The data arithmetic circuit 114 convolutes the input features and the weights determined by the fetch controller 112 to generate an output feature map. Furthermore, the data arithmetic circuit 114 generates the output feature map by accumulating convolution results of FIG. 3 on an output feature map that has already been generated through the convolution on the input feature map, different from the weight map).

	Lee fails to teach generating, by a reduction unit of the core, a reduced feature map.

However, Kim teaches generating, by a reduction unit of the core, a reduced feature map ([0025] lines 2-3 pooling performed in the calculation unit 130; Fig. 2; [0032] lines 16-18 The first feature map 220 is generated into a size-reduced second feature map 230 by the pooling layer pool1).

It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined Lee with the teachings of Kim because Kim’s teaching of a reduction unit configured to generate reduced feature maps reduces the size of feature maps, making it easier to process (see Kim, [0032] lines 4-8 The data of the first feature map 220 may be a size that is burdensome for processing depending on the number of kernels or the size of the input feature 210. Therefore, in the first pulling layer pool 1, down-sampling (or sub-sampling) is performed to reduce the size of the first feature map 220).

Claims 2, 6, 7, and 10 are rejected under 35 U.S.C. 103 as being unpatentable over Lee and Shin, as applied to claim 1 above, in view of Wu et al. (US 10,346,093 B1 herein Wu).
Wu was cited in the previous office action.

As per claim 2, Lee and Shin teach the system of claim 1. Lee specifically teaches wherein the task manager is configured to instruct the respective core to perform generation of a second output feature map corresponding to a second computation task (Lee 1010 controller; [0072] lines 3-5 controller 1010 may control operations of the processor 0 and the weight map K1 to generate an output feature map Psum1 (as a second output feature map)).
Additionally, Shin teaches the generation of the reduced feature map (Fig. 14.2.2, pooling; left column paragraph 3 lines 2-5 The CP is composed of 4 convolution clusters and 1 aggregation core. Each convolution cluster performs convolution operations with 4 convolution cores, and transfers the accumulation results to the accumulation core; The convolution cores perform convolution which outputs feature maps and the aggregation core collects the output feature maps to perform pooling which is also known as reduction.).

	Lee and Shin fail to teach simultaneously perform generation of an output feature map corresponding to a computation task and the generation of the reduced feature map.

	However, Wu teaches simultaneously perform generation of an output feature map corresponding to a computation task and the generation of the reduced feature map (Fig. 5; Col. 8 lines 8-9 pre-pooling is always scheduled to operate in parallel with 3×3 convolution; Col. 8 line 17 The operations of 1×1, 3×3_reduce, and pre-pooling; Col. 8 lines 22-25 Pre-pooling consists of comparators and convolution requires a matrix multiplier. As long as the two paths to not use the same memory port, the two processing paths can operate in parallel; Fig. 6; Col. 12 lines 9-10 The results are stored in RAMs 404, . . . , 406 as output feature maps).



As per claim 6, Lee, Shin, and Wu teach the system of claim1. Lee specifically teaches wherein the task manager is further configured to select a core among the plurality of cores to receive the feature maps and generate the second output feature map (Lee [0088] lines 6-8 the controller 1010 allocates the divided input feature maps to each of the processor groups in the processor array 1020; [0058] lines 1-3 The neural network processor 100 stores or outputs the output feature map generated by the data arithmetic circuit 114 to the internal memory; [0091] lines 9-12 a first neural network processor of the zeroth processor group convolutes the input feature map A0 and the weight map K1 to generate an output feature map Psum1 (as the second output feature map)).
Additionally, Shin teaches reduction core to receive the external output feature maps and generate the reduced output feature map (Fig. 14.2.2, pooling; left column paragraph 3 lines 2-5 The CP is composed of 4 convolution clusters and 1 aggregation core. Each convolution cluster performs convolution operations with 4 convolution cores, and transfers the A reduction core is taught because as shown in Fig. 14.2.2, the aggregation core performs pooling which is also known as reduction.).
Additionally, Wu teaches generate the reduced feature map in parallel with the generation of the output feature map (Wu Fig. 5; Col. 8 lines 8-9 pre-pooling is always scheduled to operate in parallel with 3×3 convolution; Col. 8 line 17 The operations of 1×1, 3×3_reduce, and pre-pooling; Col. 8 lines 22-25 Pre-pooling consists of comparators and convolution requires a matrix multiplier. As long as the two paths to not use the same memory port, the two processing paths can operate in parallel; Col. 12 lines 9-10 The results are stored in RAMs 404, . . . , 406 as output feature maps).

As per claim 7, Lee, Shin, and Wu teach the system of claim 6. Lee specifically teaches wherein the first output feature map is generated by performing convolution processing on a first matrix of a set of matrices (Lee Fig. 1; [0091] lines 3-6 a zeroth neural network processor of the zeroth processor group convolutes the input feature map A0 and the weight map K0 to generate an output feature map Psum0 (as the first output feature map); [0048] lines 5-6 feature maps FMl and FM2 may have a 2D or a 3D matrix shape; [0049] lines 1-4 a convolution operation on the first feature map FMl and a weight map WM is performed, and as a result the second feature map FM2 is generated; [0077] lines 1-3 Each of the neural network processors of the processor array 1020 convolutes the allocated input feature map and weight map to generate an output feature map).

As per claim 10, Lee and Shin teach the system of claim 1. Lee specifically teaches wherein the respective core further comprises a memory unit including the generation of the second output feature map (Lee Fig. 3; [0057] lines 1-2 the neural network processor 100 may further include internal memory; [0058] lines 1-3 The neural network processor 100 stores or outputs the output feature map generated by the data arithmetic circuit 114 to the internal memory; [0091] lines 9-12 a first neural network processor of the zeroth processor group convolutes the input feature map A0 and the weight map K1 to generate an output feature map Psum1 (as the second output feature map); [0058] lines 1-3 The neural network processor 100 stores or outputs the output feature map generated by the data arithmetic circuit 114 to the internal memory; [0057] lines 1-5 the neural network processor 100 may further include internal memory. The internal memory may be cache memory of the neural network processor 100. The internal memory may be static random access memory (SRAM)).
Additionally, Shin teaches the generation of the reduced output feature map (Fig. 14.2.2, pooling; left column paragraph 3 lines 2-5 The CP is composed of 4 convolution clusters and 1 aggregation core. Each convolution cluster performs convolution operations with 4 convolution cores, and transfers the accumulation results to the accumulation core; The aggregation core performs pooling which is also known as reduction.).

	Lee and Shin fail to teach a first port for the generation of the output feature map and a second port for the generation of the reduced output feature map. 

	However, Wu teaches a first port for the generation of the output feature map and a second port for the generation of the reduced output feature map. (Col. 11 lines 2-3 write port of RAMs 404 and 406; Col. 12 lines 9-10 The results are stored in RAMs 404, . . . , 406 as output feature maps; Col. 8 line 17 The operations of 1×1, 3×3_reduce, and pre-pooling; Col. 8 memory port, the two processing paths can operate in parallel).

It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined Lee and Shin with the teachings of Wu because Wu’s teaching of two memory ports increases efficiency (see Wu, Col. 9 lines 21-24 In an alternative implementation, the RAMs/tensor banks can be single-ported instead of dual-ported. The drawback is that read and write operations in the same RAM cannot overlap, thereby reducing processing array efficiency).
	
	
Claims 12, 16, and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Lee and Kim, as applied to claim 11 above, in view of Wu.

As per claim 12, Lee and Kim teach the method of claim 11. Lee specifically teaches further comprising: performing, by the processing unit of the core, generation of a second output feature map corresponding to a second computation task (Lee [0073] lines 11-13 each of the neural network processors may be implemented as a core circuit cable of executing instructions; [0091] lines 9-12 a first neural network processor of the zeroth processor group convolutes the input feature map A0 and the weight map K1 to generate an output feature map Psum1 (as a second output feature map)).
Kim teaches the generation of the reduced feature map (Kim Fig. 2; [0032] lines 16-18 The first feature map 220 is generated into a size-reduced second feature map 230 by the pooling layer pool1).

Lee and Kim fail to teach simultaneously performing generation of an output feature map corresponding to a computation task and the generation of the reduced feature map.

However, Wu teaches simultaneously performing generation of an output feature map corresponding to a computation task and the generation of the reduced feature map (Fig. 5; Col. 8 lines 8-9 pre-pooling is always scheduled to operate in parallel with 3×3 convolution; Col. 8 line 17 The operations of 1×1, 3×3_reduce, and pre-pooling; Col. 8 lines 22-25 Pre-pooling consists of comparators and convolution requires a matrix multiplier. As long as the two paths to not use the same memory port, the two processing paths can operate in parallel; Fig. 6; Col. 12 lines 9-10 The results are stored in RAMs 404, . . . , 406 as output feature maps).

It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined Lee and Kim with the teachings of Wu because Wu’s teaching of pooling and convolving at the same time allows for these processes to be done in parallel, which is more efficient (see Wu, Col. 8 lines 8-16 Note that pre-pooling is always scheduled to operate in parallel with 3x3 convolution as the 3x3 convolution shares no tensor buffers with pre-pooling and happens to consume the most time in the primary pipeline, thereby presenting the least stringent timing constraints for designing the prepooler. Every 

As per claim 16, Lee, Kim, and Wu teach the method of claim 12. Lee specifically teaches wherein a core among the plurality of cores is selected by the task manager to receive the external output feature maps and generate the second output feature map (Lee [0088] lines 6-8 the controller 1010 allocates the divided input feature maps to each of the processor groups in the processor array 1020; [0058] lines 1-3 The neural network processor 100 stores or outputs the output feature map generated by the data arithmetic circuit 114 to the internal memory; [0064] lines 1-8 The data arithmetic circuit 114 convolutes the input features and the weights determined by the fetch controller 112 to generate an output feature map. Furthermore, the data arithmetic circuit 114 generates the output feature map by accumulating convolution results of FIG. 3 on an output feature map that has already been generated through the convolution on the input feature map, different from the weight map; [0080] lines 5-8 each of the neural network processors included in the processor groups of the processor array 1020 receives an identical input feature map and different weight maps; [0073] lines 5-7 each of the plurality of neural network processors included in the processor array 1020 may be the neural network processor 100 of FIGS. 2 and 3).
          Additionally, Kim teaches a reduction core to generate the reduced feature map (Kim [0025] lines 2-3 pooling performed in the calculation unit 130; [0024] lines 1-2 The calculation unit 130 may include a plurality of MAC cores; [0020] lines 5-6 components for implementing hardware such as a Graphic Processing Unit (GPU); [0032] lines 16-18 The first feature map 220 is generated into a size-reduced second feature map 230 by the pooling layer pool1).
Wu teaches generate the reduced feature map in parallel with the generation of the second output feature map (Wu Fig. 5; Col. 8 lines 8-9 pre-pooling is always scheduled to operate in parallel with 3×3 convolution; Col. 8 line 17 The operations of 1×1, 3×3_reduce, and pre-pooling; Col. 8 lines 22-25 Pre-pooling consists of comparators and convolution requires a matrix multiplier. As long as the two paths to not use the same memory port, the two processing paths can operate in parallel; Col. 12 lines 9-10 The results are stored in RAMs 404, . . . , 406 as output feature maps).

As per claim 17, Lee, Kim, and Wu teach the method of claim 16. Lee specifically teaches wherein the first output feature map is generated by performing convolution processing on a first matrix of a set of matrices (Lee Fig. 1; [0091] lines 3-6 a zeroth neural network processor of the zeroth processor group convolutes the input feature map A0 and the weight map K0 to generate an output feature map Psum0 (as the first output feature map); [0048] lines 5-6 feature maps FMl and FM2 may have a 2D or a 3D matrix shape; [0049] lines 1-4 a convolution operation on the first feature map FMl and a weight map WM is performed, and as a result the second feature map FM2 is generated; [0077] lines 1-3 Each of the neural network processors of the processor array 1020 convolutes the allocated input feature map and weight map to generate an output feature map).

Claims 3 and 4 are rejected under 35 U.S.C. 103 as being unpatentable over Lee, Shin, and Wu, as applied to claim 2 above, in view of Steinmacher-Burow (US 2019/0303295 Al). 
Steinmacher-Burow was cited in the previous office action.

As per claim 3, Lee, Shin, and Wu teach the system of claim 2. Lee specifically teaches wherein when the first output feature map is generated, the processing unit is further configured to issue an instruction to the task manager, and the generation of the second output feature map (Lee [0091] lines 3-6 a zeroth neural network processor of the zeroth processor group convolutes the input feature map A0 and the weight map K0 to generate an output feature map Psum0 (as the first output feature map); [0091] lines 9-12 a first neural network processor of the zeroth processor group convolutes the input feature map A0 and the weight map K1 to generate an output feature map Psum1 (as the second output feature map); [0073] lines 11-13 each of the neural network processors may be implemented as a core circuit cable of executing instructions; [0072] lines 3-5 controller 1010 may control operations of the processor array 1020).

Lee, Shin, and Wu fail to teach issue a barrier instruction and stalls. 

However, Steinmacher-Burow teaches issue a barrier instruction and stalls ([0073] lines 9-11 each processor comprises a memory barrier instruction implemented by stalling the execution by the processor of its instruction).

It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have modified Lee, Shin, and Wu with the teachings of Steinmacher-Burow because Steinmacher-Burow’s teaching of issuing barrier instruction and stalling ensures synchronization and coherence (see Steinmacher-Burow, [0004] lines 20-23 A 

As per claim 4, Lee, Shin, Wu, and Steinmacher-Burow teach the system of claim 3. Lee specifically teaches wherein the task manager is further configured to receive (Lee [0076] lines 11-12 controller 1010 receives input feature information and weight information). 
Additionally, Steinmacher-Burow teaches the barrier instructions from the respective core of the plurality of cores (Steinmacher-Burow [0073] lines 9-11 each processor comprises a memory barrier instruction implemented by stalling the execution by the processor of its instruction).

Claims 13 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over Lee, Kim, and Wu, as applied to claim 12 above, in view of Steinmacher-Burow.

As per claim 13, Lee, Kim, and Wu teach the method of claim 12. Lee specifically teaches further comprising: in response to the first output feature map being generated, issuing an instruction to the task manager, and the generation of the second output feature map  (Lee [0091] lines 3-6 a zeroth neural network processor of the zeroth processor group convolutes the input feature map A0 and the weight map K0 to generate an output feature map Psum0 (as the first output feature map); [0091] lines 9-12 a first neural network processor of the zeroth processor group convolutes the input feature map A0 and the weight map K1 to generate an output feature map Psum1 (as the second output feature map); [0073] lines 11-13 each of the .

Lee, Kim, and Wu fail to teach issuing a barrier instruction and stalling.

However, Steinmacher-Burow teaches issuing a barrier instruction and stalling ([0073] lines 9-11 each processor comprises a memory barrier instruction implemented by stalling the execution by the processor of its instruction).

It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have modified Lee, Kim, and Wu with the teachings of Steinmacher-Burow because Steinmacher-Burow’s teaching of issuing barrier instruction and stalling ensures synchronization and coherence (see Steinmacher-Burow, [0004] lines 20-23 A memory barrier instruction defines a point in an instruction sequence at which coherence has to be ensured and, if necessary, a synchronization performed).

As per claim 14, Lee, Kim, Wu, and Steinmacher-Burow teach the method of claim 13. Lee specifically teaches task manager (Lee [0076] lines 11-12 controller 1010 receives input feature information and weight information).
Additionally, Steinmacher-Burow teaches further configured to determine whether the barrier instructions from each of the plurality of cores are received (Steinmacher-Burow [0073] lines 9-11 each processor comprises a memory barrier instruction implemented by stalling the execution by the processor of its instruction; [0009] lines 1-4 Embodiments may have the .

Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Lee, Shin, Wu, and Steinmacher-Burow, as applied to claim 4 above, in view of Xu et al. (CN 101908034 A herein Xu). 
The claim mappings for Xu will be made using the translation of CN 101908034 A.
Xu was cited in the previous office action.

As per claim 5, Lee, Shin, Wu, and Steinmacher-Burow teach the system of claim 4. Lee specifically teaches the respective core of the plurality of cores, the task manager is configured to send an instruction to the respective core to generate the second output feature map (Lee 1010 controller; [0072] lines 3-5 controller 1010 may control operations of the processor array 1020; [0073] lines 11-13 each of the neural network processors may be implemented as a core circuit cable of executing instructions; [0091] lines 9-12 a first neural network processor of the zeroth processor group convolutes the input feature map A0 and the weight map K1 to generate an output feature map Psum1).

Lee, Shin, Wu, and Steinmacher-Burow fail to teach wherein in response to the determination that the barrier instructions are received from the respective core of the plurality of cores, send a resume instruction to the respective core to resume.

wherein in response to the determination that the barrier instructions are received from the respective core of the plurality of cores, send a resume instruction to the respective core to resume ([0146] 919 lines 2-6 checks whether the fence messages of the small cores participating in the fence synchronization corresponding to the fence message have arrived at the synchronization management device according to the record and the fence message. If yes, according to the sequence in which the fence messages of each corelet arrive at the synchronization management device, a confirmation message is sent to each corelet synchronized by the fence in turn; [0147] 928 lines 1-2 In step S2300, after receiving the confirmation message, the small core continues to execute the instructions following the fence instruction; [0019] 139 lines 1-2 When the small core executes the barrier instruction, it sends a barrier message to the synchronization management device).

It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined Lee, Shin, Wu, and Steinmacher-Burow with the teachings of Xu because Xu’s teaching of barrier instructions and resume instructions provides the advantage of synchronization in a multi-core environment (see Xu, [0010] 79 The invention discloses an on-chip synchronization method for a many-core processor).

Claims 15 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Lee, Kim, Wu, and Steinmacher-Burow, as applied to claim 14 above, in view of Xu.

As per claim 15, Lee, Kim, Wu, and Steinmacher-Burow the method of claim 14. Lee specifically teaches each of the plurality of cores, the task manager is configured to send an instruction to the core for the generation of the second output feature map (Lee 1010 controller; [0072] lines 3-5 controller 1010 may control operations of the processor array 1020; [0073] lines 11-13 each of the neural network processors may be implemented as a core circuit cable of executing instructions; [0091] lines 9-12 a first neural network processor of the zeroth processor group convolutes the input feature map A0 and the weight map K1 to generate an output feature map Psum1).

Lee, Kim, Wu, and Steinmacher-Burow fail to teach wherein in response to the determination that the barrier instructions are received from each of the plurality of cores, send a resume instruction to the core to resume the generation of the second output feature map.

However, Xu teaches wherein in response to the determination that the barrier instructions are received from each of the plurality of cores, send a resume instruction to the core to resume the generation of the second output feature map ([0146] 919 lines 2-6 checks whether the fence messages of the small cores participating in the fence synchronization corresponding to the fence message have arrived at the synchronization management device according to the record and the fence message. If yes, according to the sequence in which the fence messages of each corelet arrive at the synchronization management device, a confirmation message is sent to each corelet synchronized by the fence in turn; [0147] 928 lines 1-2 In step S2300, after receiving the confirmation message, the small core continues to execute the instructions following the fence instruction; [0019] 139 lines 1-2 When the small core executes the barrier instruction, it sends a barrier message to the synchronization management device).
79 The invention discloses an on-chip synchronization method for a many-core processor).

As per claim 20, it is a method claim of claim 15. Therefore, it is rejected for the same reasons as claim 15 above.

Claim 8 is rejected under 35 U.S.C. 103 as being unpatentable over Lee, Shin, and Wu, as applied to claim 7 above, in view of Rhine (US 2004/0117790 Al).
Rhine was cited in the previous office action.

As per claim 8, Lee, Shin, and Wu teach the system of claim 7. Lee specifically teaches wherein a number of the cores is N, a number of the set of matrices is K, and for a kth iteration, wherein k is a positive integer that is less than or equal to K (Lee Fig. 7A, 7B; [0079] lines 4-6 predetermined number of neural network processors into one processor group, and consequently determines a plurality of processor groups; [0048] lines 5-6 feature maps FMl and FM2 may have a 2D or a 3D matrix shape; [0088] lines 1-4 input feature map with a width W, a height H, and a channel C (as K) according to a spatial dimension and sequentially allocates the divided input feature maps to each of the processor groups; [0089] lines 1-3 allocates each of the divided input feature maps A0 and A1 to each of a zeroth processor group and a first processor group; [0092] lines 1-5 allocates each of the divided input feature maps A2 and A3 to In other words, in the  first iteration A0 and A1 are allocated (Fig. 7A), and in the second iteration A2 and A3 are allocated (Fig. 7B). There are 4 feature maps and 2 iterations, so k is less than or equal to K.).
Additionally, Shin teaches the reduction core (Fig. 14.2.2, pooling; The aggregation core performs pooling which is also known as reduction.)

	Lee, Shin, and Wu fail to teach the core is a (k%N)th core among the plurality of cores 

	However, Rhine teaches the core is a (k%N)th core among the plurality of cores ([0004] lines 4-8 if there are 10 CPUs, the first process will be assigned to the first CPU, the second process will be assigned to the second CPU, and so forth. After the last CPU is reached on the tenth process, the first CPU is again assigned to the eleventh process, and hence the name round-robin; [0011] lines 7-11 The processes in the first priority group are distributed among resources of the first group of resources in a round-robin fashion starting from a first starting resource of the first group of resources).

It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined Lee, Shin, and Wu with the teachings of Rhine because Rhine’s teaching of utilizing round-robin allows for a fair balance of work among cores (see Rhine, [0054] lines 10-14 Because it incorporates the round-robin techniques, in any case where we have more processes in each priority group than CPUs (over-committed) we achieve the same ideal balance as the unmodified round-robin).

Claim 18 is rejected under 35 U.S.C. 103 as being unpatentable over Lee, Kim, and Wu, as applied to claim 17 above, in view of Rhine.

As per claim 18, Lee, Kim, and Wu the method of claim 17. Lee specifically teaches wherein a number of the cores is N, a number of the set of matrices is K, and for a kth iteration, wherein k is a positive integer that is less than or equal to K. (Lee Fig. 7A, 7B; [0079] lines 4-6 predetermined number of neural network processors into one processor group, and consequently determines a plurality of processor groups; [0048] lines 5-6 feature maps FMl and FM2 may have a 2D or a 3D matrix shape; [0088] lines 1-4 input feature map with a width W, a height H, and a channel C (as K) according to a spatial dimension and sequentially allocates the divided input feature maps to each of the processor groups; [0089] lines 1-3 allocates each of the divided input feature maps A0 and A1 to each of a zeroth processor group and a first processor group; [0092] lines 1-5 allocates each of the divided input feature maps A2 and A3 to each of the zeroth processor group and the first processor group of the processor array 1020 after the convolution on the divided input feature maps A0 and A1; In other words, in the  first iteration A0 and A1 are allocated (Fig. 7A), and in the second iteration A2 and A3 are allocated (Fig. 7B). There are 4 feature maps and 2 iterations, so k is less than or equal to K.).
	Additionally, Kim teaches the reduction core (Kim [0025] lines 2-3 pooling performed in the calculation unit 130; [0024] lines 1-2 The calculation unit 130 may include a plurality of MAC cores; [0020] lines 5-6 components for implementing hardware such as a Graphic Processing Unit (GPU); [0032] lines 16-18 The first feature map 220 is generated into a size-reduced second feature map 230 by the pooling layer pool1).

Lee, Kim, and Wu fail to teach the core is a (k%N)th core among the plurality of cores 

	However, Rhine teaches the core is a (k%N)th core among the plurality of cores ([0004] lines 4-8 if there are 10 CPUs, the first process will be assigned to the first CPU, the second process will be assigned to the second CPU, and so forth. After the last CPU is reached on the tenth process, the first CPU is again assigned to the eleventh process, and hence the name round-robin; [0011] lines 7-11 The processes in the first priority group are distributed among resources of the first group of resources in a round-robin fashion starting from a first starting resource of the first group of resources).

It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined Lee, Kim, and Wu with the teachings of Rhine because Rhine’s teaching of utilizing round-robin allows for a fair balance of work among cores (see Rhine, [0054] lines 10-14 Because it incorporates the round-robin techniques, in any case where we have more processes in each priority group than CPUs (over-committed) we achieve the same ideal balance as the unmodified round-robin).

Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Lee and Shin, as applied to claim 1 above, in view of Sirotkovic et al. (US 2019/0188295 Al herein Sirotkovic).
Sirotkovic was cited in the previous office action.

As per claim 9, Lee and Shin teach the system of claim 1. Lee specifically teaches wherein, among the plurality of cores, a first core generates a first output feature map, a second core generates a second output feature map, a third core generates a third reduced output feature map, and a fourth core generates a fourth output feature map, and the first output feature map, the second output feature map, the third output feature map, and the fourth output feature map are combined into a set of output feature maps (Lee Fig. 7A, 7B; [0091] lines 3-18 a zeroth neural network processor of the zeroth processor group convolutes the input feature map A0 and the weight map K0 to generate an output feature map Psum0…In addition, a first neural network processor of the zeroth processor group convolutes the input feature map A0 and the weight map K1 to generate an output feature map Psum1. Similarly, the zeroth neural network processor of the first processor group convolutes the input feature map A1 and the weight map K0 to generate the output feature map Psum0. The first neural network processor of the first processor group convolutes the input feature map A1 and the weight map K1 to generate the output feature map Psum1; [0106] lines 3-7 the memory 140 stores intermediate results generated during the convolution performed by the neural network apparatus 130, for example, output feature maps, as an output feature list or an output feature matrix).
Additionally, Shin teaches a first reduced output feature map, a second reduced output feature map, a third reduced output feature map, and a fourth reduced output feature map, and the first reduced output feature map, the second reduced output feature map, the third reduced output feature map, and the fourth reduced output feature map are a set of reduced output feature maps (Shin Fig. 14.2.2; left column paragraph 3 lines 2-5 The CP is composed of 4 convolution clusters and 1 aggregation core. Each convolution cluster performs convolution operations with 4 convolution cores, and transfers the accumulation results Each convolution cluster produces an output feature map and these output feature maps are sent to the aggregation core which performs reduction (pooling), so first, second, third, and fourth reduced output feature maps are taught.).

Lee and Shin fail to teach combined into a set of output feature maps in an interleaved manner.

However, Sirotkovic teaches combined into a set of output feature maps in an interleaved manner (Fig. 7; [0062] lines 1-12 The implementation of the pooling portion 708 of the CNN 702 in FIG. 7 may utilize a cross-channel pooling process…In some other implementation, the feature map stacks 720, 722, 724, and 726 may be stacked in any other predetermined order to form the single feature map stack 730. Alternatively, they may be interleaved in any predetermined manner to form the single feature map stack 730).

It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined Lee and Shin with Sirotkovic’s teaching of interleaving output feature maps in order to arrange feature maps in a way that is the most suitable (see Sirotkovic, [0062] lines 11-12 they may be interleaved in any predetermined manner to form the single feature map stack).
	
Claim 19 is rejected under 35 U.S.C. 103 as being unpatentable over Lee and Kim, as applied to claim 11 above, in view of Sirotkovic.

As per claim 19, Lee and Kim teach the method of claim 11. Lee specifically teaches wherein among the plurality of cores, a first core generates a first output feature map, a second core generates a second output feature map, a third core generates a third reduced output feature map, and a fourth core generates a fourth output feature map, and the first output feature map, the second output feature map, the third output feature map, and the fourth output feature map are combined into a set of output feature maps (Lee Fig. 7A, 7B; [0091] lines 3-18 a zeroth neural network processor of the zeroth processor group convolutes the input feature map A0 and the weight map K0 to generate an output feature map Psum0…In addition, a first neural network processor of the zeroth processor group convolutes the input feature map A0 and the weight map K1 to generate an output feature map Psum1. Similarly, the zeroth neural network processor of the first processor group convolutes the input feature map A1 and the weight map K0 to generate the output feature map Psum0. The first neural network processor of the first processor group convolutes the input feature map A1 and the weight map K1 to generate the output feature map Psum1; [0106] lines 3-7 the memory 140 stores intermediate results generated during the convolution performed by the neural network apparatus 130, for example, output feature maps, as an output feature list or an output feature matrix).
Additionally, Kim teaches a first reduced output feature map, a second reduced output feature map, a third reduced output feature map, and a fourth reduced output feature map, and the first reduced output feature map, the second reduced output feature map, the third reduced output feature map, and the fourth reduced output feature map are a set of reduced output feature maps (Kim Fig. 3; [0065] lines 1-5 The instruction generator 331 generates a command that allows each processor unit to perform convolution, batch normalization, and pooling using the feature map delivered from the feature map memory on the Since each processor unit can generate an output feature map and can perform pooling, there are a plurality of reduced output feature maps. Fig. 3 shows at least 4 processor units so first, second, third, and fourth reduced output feature maps are taught.).

Lee and Kim fail to teach combined into a set of output feature maps in an interleaved manner.

However, Sirotkovic teaches combined into a set of output feature maps in an interleaved manner (Fig. 7; [0062] lines 1-12 The implementation of the pooling portion 708 of the CNN 702 in FIG. 7 may utilize a cross-channel pooling process…In some other implementation, the feature map stacks 720, 722, 724, and 726 may be stacked in any other predetermined order to form the single feature map stack 730. Alternatively, they may be interleaved in any predetermined manner to form the single feature map stack 730).

It would have been obvious to one having ordinary skill in the art before the effective filling date of the claimed invention to have combined Lee and Kim with Sirotkovic’s teaching of interleaving output feature maps in order to arrange feature maps in a way that is the most suitable (see [0062] lines 11-12 they may be interleaved in any predetermined manner to form the single feature map stack).
	

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to HSING CHUN LIN whose telephone number is (571)272-8522.  The examiner can normally be reached on Mon - Fri 9AM-5PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
 (571)272-3756.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/H.L./Examiner, Art Unit 2195                                                                                                                                                                                                        
/MENG AI T AN/Supervisory Patent Examiner, Art Unit 2195