Detailed Action
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claims 1-13, 16-24, and 26-28 are pending for examination. Claims 1, 13, and 24 are independent.

Response to Amendment
The office action is responsive to the amendment filed on 05/04/2021. As directed by the amendment, claims 1, 13, and 24 are amended. 

Response to Arguments
Applicant's arguments filed 05/04/2022 have been fully considered but they are not persuasive. 
Applicant argues:
Nurvitadhi and Mills do not disclose or suggest feature (iii) 
“Mills does not disclose or suggest at least the feature (iii). The Action mapped Mills's kernel DMA 324 to the claimed "first DMA data path" and "second DMA data path," and Mills's neural engines 314 to the claimed "first processor core" and the "second processor core." (Action at 6.) The Action further argued that, relying on portions of Mills, "Examiner interprets the connections shown in figure 3 as disclosing the wired connections. Examiner also interprets the data buffers 318 and 320 as first/second load-store data paths." (Action at 7.) However, such an interpretation is erroneous”
“More specifically, the cited portions fail to disclose or suggest "at least a portion of the wired connections are shared between the first DMA data path and the second DMA data path; and at least another portion of the wired connections are shared between the first load-store data path and the second load- store data path." 
Examiners response: 
Examiner respectfully disagrees, the lines shown in the circuit diagram of Fig 3 of Mills discloses in wired connections. Under broadest reasonable interpretation, the connections between Data buffer 318 and Neural Engine 314 discloses a shared connection between the first DMA data path to a first core (i.e. connection 322A) and the second DMA data path to a second core (i.e. 328N). The shared connection between Buffers 318 and 320 disclose the connection between load-store paths.
The rest of the Arguments are moot because the new ground of rejection does not rely on any reference applied in the prior rejection of record for any teaching or matter specifically challenged in the argument.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 1-9, 11-13, 16-24, and 26-28 is/are rejected under 35 U.S.C. 103 as being unpatentable over Nurvitadhi et al. (US 20180315158, hereinafter "Nurvitadhi") in view of Phelps et al. (US 20180336165 A1, hereinafter "Phelps") and Mills et al. (US 2019/0340486, hereinafter "Mills")

Regarding Claim 1
Nurvitadhi discloses: A circuit configured to implement a neural network comprising a plurality of neural network layers ([Para 0145 and Fig 2A]), the circuit comprising: 
a first memory configured to provide data for performing computations to generate an output for a layer of the neural network ([Para 0152, Fig. 1 “system memory”, Fig 7. “Memory” Fig 14 “Unified Memory”, “System Memory”, “GPGPU memory”]);  
a shared memory disposed intermediate the first memory and at least one of the first processor core or the second processor core, wherein the shared memory comprises: 
a first load-store data path configured to route data communications between the shared memory and the [[a]] first vector register included in the first processor core, and a second load-store data path configured to route data communications between the shared memory and the [[a]] second vector register included in the second processor core ([Fig. 2D (266)] “one or more general purpose graphics processing unit (GPGPU) cores 262, and one or more load/store units 266. The GPGPU cores 262 and load/store units 266 are coupled with cache memory 272 and shared memory 270 via a memory and cache interconnect 268.” [Para 0077] “The register file 258 provides a set of registers for the functional units of the graphics multiprocessor 234. The register file 258 provides temporary storage for operands connected to the data paths of the functional units (e.g., GPGPU cores 262, load/store units 266) of the graphics multiprocessor 234.” Examiner reads the operands stored by the register file 258 as a respective vector register in the processor core (i.e. GPGPU). Fig 2D shows the load/store unit routs data to shared memory 270 and register file 258 (i.e. vector register) in processor core. The additional cores and load/store units are interpreted as disclosing the second units.); Application No. : 15/931,970 Filed: May 14, 2020 Page: 3of15 
Nurvitadhi does not explicitly disclose:  a first vector memory located within the first processor core and configured to store first vector values derived from the data provided by the first memory, and a first vector register located within the first processor core and configured to at least load data from or store data to the first vector memory; a second processor core comprising: a second vector memory located within the second processor core and configured to store second vector values derived from the data provided by the first memory, and a second vector register located within the second processor core and configured to at least load data from or store data to the second vector memory;
However, Phelps discloses in the same field of endeavor: 
a first processor core ([Fig 1A (103a)]) comprising: 
a first vector memory located within the first processor core and configured to store first vector values derived from the data provided by the first memory ([0036 and Fig 1A-B] “FIG. 1B shows a high-level example of compute core (101). The compute core can be a machine, i.e., a VLIW machine, that controls several compute units in parallel. Each compute core (101) contains: a scalar memory (104), a vector memory (108), a scalar processor (107), vector registers (106), and extended vector units (i.e., a matrix multiply unit (MXU) (113), a tr3anspose unit (XU) (114), and a reduction and permutation unit (RPU) (116)).” Examiner reads core 103a shown in fig 1A as a first processor core. Vector memory 108 and vector register 106 are within core 103a as shown in Fig. 1B.), and 
a first vector register located within the first processor core and configured to at least load data from or store data to the first vector memory ([Para 0050] “The matrix multiply unit (113) can process weight inputs and activation inputs and provide a vector of outputs to the vector registers 106. The vector processing unit can process the vector of outputs and store a vector of processed outputs to the vector memory.” Examiner reads the vector register 106 as storing data (i.e. vector outputs) to the vector memory 108.); 
a second processor core ([Fig 1A (103b)]) comprising: 
a second vector memory located within the second processor core and configured to store second vector values derived from the data provided by the first memory ([0036 and Fig 1A-B] “FIG. 1B shows a high-level example of compute core (101). The compute core can be a machine, i.e., a VLIW machine, that controls several compute units in parallel. Each compute core (101) contains: a scalar memory (104), a vector memory (108), a scalar processor (107), vector registers (106), and extended vector units (i.e., a matrix multiply unit (MXU) (113), a tr3anspose unit (XU) (114), and a reduction and permutation unit (RPU) (116)).” Examiner reads core 103b shown in fig 1A as a second processor core.), and 
a second vector register located within the second processor core and configured to at least load data from or store data to the second vector memory ([Para 0050] “The matrix multiply unit (113) can process weight inputs and activation inputs and provide a vector of outputs to the vector registers 106. The vector processing unit can process the vector of outputs and store a vector of processed outputs to the vector memory.” Examiner reads the vector register 106 as storing data (i.e. vector outputs) to the vector memory 108.); and 
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the Apparatus to perform machine learning operations disclosed by Nurvitadhi with the Hardware for Matrix Multiplication taught by Phelps. One of ordinary skill in the art would have been motivated to make this modification in order to perform matrix multiplication using a hardware circuits (Abstract, Phelps).
Nurvitadhi in view of Phelps does not explicitly disclose: a first direct memory access (DMA) data path configured to route data communications between the shared memory and the first vector memory included in the first processor core, a second direct memory access (DMA) data path configured to route data communications between the shared memory and the second vector memory included in the second processor core; wherein the first and second DMA data paths and the first and second load-store data paths are established using wired connections, wherein at least a portion of the wired connections are shared between the first DMA data path and the second DMA data path; and at least another portion of the wired connections are shared between the first load-store data path and the second load-store data path.
However, Mills discloses in the same field of endeavor: a first direct memory access (DMA) data path configured to route data communications between the shared memory and the first vector memory included in the first processor core, a second direct memory access (DMA) data path configured to route data communications between the shared memory and the second vector memory included in the second processor core ([Para 0051 and Fig 3] “Kernel DMA 324 is a read circuit that fetches kernel data from a source ( e.g. , system memory 230 ) and sends kernel data 326A through 326N to each of the neural engines 314.” Examiner reads the kernel DMA as routing data between shared memory (i.e. System memory 230) and first vector memory in the first processor core (i.e. Neural Engine 314A). [Para 0087] “The neural engine 314 may further receive, from the kernel DMA 324 (kernel fetcher circuit), vector elements of the vector 904 as kernel data 326.” Fig 4 discloses that the neural engine 314 includes a processor core. The DMA paths to neural engines 314A-314N discloses a second path.); 
wherein the first and second DMA data paths and the first and second load-store data paths are established using wired connections, wherein at least a portion of the wired connections are shared between the first DMA data path and the second DMA data path; and at least another portion of the wired connections are shared between the first load-store data path and the second load-store data path ([Para 0048-0054 and Fig 3] Examiner interprets the connections shown in figure 3 as disclosing the wired connections. Examiner also interprets the data buffers 318 and 320 as first/second load-store data paths.).
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the Apparatus to perform machine learning operations disclosed by Nurvitadhi with the Hardware for Matrix Multiplication taught by Phelps with the neural processor circuit disclosed by Mills. One of ordinary skill in the art would have been motivated to make this modification in order to perform vector computations in a neural engine circuit (Abstract, Mills).

Regarding Claim 13 
Nurvitadhi discloses: A method for performing computations to generate an output for a layer of a neural network comprising a plurality of neural network layers using a circuit configured to implement the neural network ([Para 0145 and Fig 2A]), the method comprising: providing, from a first memory, data used to generate an output for a neural network layer ([Para 0152, Fig. 1 “system memory”, Fig 7. “Memory”, Fig 14 “Unified Memory”, “System Memory”, “GPGPU memory”]); routing, using a first load-store data path of the shared memory, data communications comprising third second vector values between the shared memory and a first respective vector register included in the first processor core; routing, using a second load-store data path of the shared memory, data communications comprising fourth vector values between the shared memory and a second vector register included in the second processor core  ([Fig. 2D (266)] “one or more general purpose graphics processing unit (GPGPU) cores 262, and one or more load/store units 266. The GPGPU cores 262 and load/store units 266 are coupled with cache memory 272 and shared memory 270 via a memory and cache interconnect 268.” [Para 0077] “The register file 258 provides a set of registers for the functional units of the graphics multiprocessor 234. The register file 258 provides temporary storage for operands connected to the data paths of the functional units (e.g., GPGPU cores 262, load/store units 266) of the graphics multiprocessor 234.” Examiner reads the operands stored by the register file 258 as a respective vector register in the processor core (i.e. GPGPU). Fig 2D shows the load/store unit routs data to shared memory 270 and register file 258 (i.e. vector register) in processor core. The additional cores and load/store units are interpreted as disclosing the second units.); and Page: 6of13 generating, by a matrix computation unit, accumulated values corresponding to the output for the neural network layer using the respective first and second third vector values that are routed to the matrix computation unit in parallel along the first respective load-store data path and the first DMA data path of the shared memory, respectively ([Para 0195 and Fig 22] “In one embodiment the sparse compute accelerator unit 1423 is configured to perform matrix multiplications for neural networks having sparse weight values”); 
Nurvitadhi does not explicitly disclose: storing vectors of values at a first processor core of the circuit using a first vector memory of the first processor core, wherein the first vector memory is located within the first processor core and configured to store first vector values derived from the data provided by the first memory, and wherein the first processor core further comprises a first vector register located within the first processor core and configured to at least load data from or store data to the first vector memory; storing vectors of values at a second processor core of the circuit using a second vector memory of the second processor core, wherein the second vector memory is located within the second processor core and configured to store second vector values derived from the data provided by the first memory, and wherein the second processor core further comprises a second vector register located within the second processor core and configured to at least load data from or store data to the second vector memory;
However, Phelps discloses in the same field of endeavor: storing vectors of values at a first processor core of the circuit using a first vector memory of the first processor core, wherein the first vector memory is located within the first processor core and configured to store first vector values derived from the data provided by the first memory ([0036 and Fig 1A-B] “FIG. 1B shows a high-level example of compute core (101). The compute core can be a machine, i.e., a VLIW machine, that controls several compute units in parallel. Each compute core (101) contains: a scalar memory (104), a vector memory (108), a scalar processor (107), vector registers (106), and extended vector units (i.e., a matrix multiply unit (MXU) (113), a tr3anspose unit (XU) (114), and a reduction and permutation unit (RPU) (116)).” Examiner reads core 103a shown in fig 1A as a first processor core. Vector memory 108 and vector register 106 are within core 103a as shown in Fig. 1B.), and wherein the first processor core further comprises a first vector register located within the first processor core and configured to at least load data from or store data to the first vector memory ([Para 0050] “The matrix multiply unit (113) can process weight inputs and activation inputs and provide a vector of outputs to the vector registers 106. The vector processing unit can process the vector of outputs and store a vector of processed outputs to the vector memory.” Examiner reads the vector register 106 as storing data (i.e. vector outputs) to the vector memory 108.); storing vectors of values at a second processor core of the circuit using a second vector memory of the second processor core, wherein the second vector memory is located within the second processor core and configured to store second vector values derived from the data provided by the first memory  ([0036 and Fig 1A-B] “FIG. 1B shows a high-level example of compute core (101). The compute core can be a machine, i.e., a VLIW machine, that controls several compute units in parallel. Each compute core (101) contains: a scalar memory (104), a vector memory (108), a scalar processor (107), vector registers (106), and extended vector units (i.e., a matrix multiply unit (MXU) (113), a tr3anspose unit (XU) (114), and a reduction and permutation unit (RPU) (116)).” Examiner reads core 103b shown in fig 1A as a second processor core.), and wherein the second processor core further comprises a second vector register located within the second processor core and configured to at least load data from or store data to the second vector memory ([Para 0050] “The matrix multiply unit (113) can process weight inputs and activation inputs and provide a vector of outputs to the vector registers 106. The vector processing unit can process the vector of outputs and store a vector of processed outputs to the vector memory.” Examiner reads the vector register 106 as storing data (i.e. vector outputs) to the vector memory 108.);
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the Apparatus to perform machine learning operations disclosed by Nurvitadhi with the Hardware for Matrix Multiplication taught by Phelps. One of ordinary skill in the art would have been motivated to make this modification in order to perform matrix multiplication using a hardware circuits (Abstract, Phelps).
Nurvitadhi in view of Phelps does not explicitly disclose: routing, using a first direct memory access (DMA) data path of a shared memory in the circuit, data communications comprising at least the first vector values between the shared memory and the first vector memory included in the first processor core; routing, using a second direct memory access (DMA) data path of the shared memory in the circuit, data communications comprising at least the second vector values between the shared memory and the second vector memory included in the second processor core; wherein the first and second DMA data paths and the first and second load-store data paths are established using wired connections, wherein at least a portion of the wired connections are shared between the first DMA data path and the second DMA data path; and at least another portion of the wired connections are shared between the first load-store data path and the second load-store data path.
However, Mills discloses in the same field of endeavor: routing, using a first direct memory access (DMA) data path of a shared memory in the circuit, data communications comprising at least the first vector values between the shared memory and the first vector memory included in the first processor core; routing, using a second direct memory access (DMA) data path of the shared memory in the circuit, data communications comprising at least the second vector values between the shared memory and the second vector memory included in the second processor core ([Para 0051 and Fig 3] “Kernel DMA 324 is a read circuit that fetches kernel data from a source ( e.g. , system memory 230 ) and sends kernel data 326A through 326N to each of the neural engines 314.” Examiner reads the kernel DMA as routing data between shared memory (i.e. System memory 230) and first vector memory in the first processor core (i.e. Neural Engine 314A). [Para 0087] “The neural engine 314 may further receive, from the kernel DMA 324 (kernel fetcher circuit), vector elements of the vector 904 as kernel data 326.” Fig 4 discloses that the neural engine 314 includes a processor core. The DMA paths to neural engines 314A-314N discloses a second path.); wherein the first and second DMA data paths and the first and second load-store data paths are established using wired connections, wherein at least a portion of the wired connections are shared between the first DMA data path and the second DMA data path; and at least another portion of the wired connections are shared between the first load-store data path and the second load-store data path ([Para 0048-0054 and Fig 3] Examiner interprets the connections shown in figure 3 as disclosing the wired connections. Examiner also interprets the data buffers 318 and 320 as first/second load-store data paths.)	
It would have been obvious for one of ordinary skill in the art before the effective filing date of the claimed invention to have combined the Apparatus to perform machine learning operations disclosed by Nurvitadhi with the Hardware for Matrix Multiplication taught by Phelps with the neural processor circuit disclosed by Mills. One of ordinary skill in the art would have been motivated to make this modification in order to perform vector computations in a neural engine circuit (Abstract, Mills).

Regarding Claim 24
Nurvitadhi in view of Phelps and Mill discloses: A non-transitory machine-readable storage device for implementing a neural network having multiple neural network layers on a circuit used to perform neural network computations ([Para 0145 and Fig 2A], Nurvitadhi) and for storing instructions that are executable by a processing device to cause performance of operations comprising: (The rest of the claim limitations correspond to method claim 13 and are rejected on the same grounds.)

Regarding Claim 2
Nurvitadhi in view of Phelps and Mill discloses: The circuit of claim 1, wherein: the circuit comprises a plurality of processor cores, the first processor core and the second processor core being among the plurality of processor cores ([Fig 2D “GPGPU Cores 262”]); and the shared memory comprises a plurality of memory resources that are physically distributed about the circuit to exchange data communications with each of the plurality of processor cores at the circuit ([Fig 2D “Shared Memory 270”], Nurvitadhi discloses in Fig 2D Shared memory resource that is connected to Cache Memory 272 (i.e. memory resources) and GPGPU Cores 262 (i.e. processor cores).[Fig 14]).

Regarding Claim 3
Nurvitadhi in view of Phelps and Mill discloses: The circuit of claim 2, wherein the shared memory comprises a shared memory control unit configured to: execute software instructions that cause a first portion of the plurality of memory resources to function as a DMA memory unit operable to move data between the first memory and each of the first processor core and the second processor core ([Para 0208 and Fig 18], Nurvitadhi “the hybrid memory module 1430 additionally includes a control processor 1802 and a primary memory controller 1805. The control processor 1802 and primary memory controller 1805 can work in concert with a DMA controller 1803 to enable a DMA memory transfer of data to, from, and between modules of the GPGPU local memory 1434A - 1434B.”).

Regarding Claim 4
Nurvitadhi in view of Phelps and Mill discloses: The circuit of claim 3, wherein the plurality of memory resources comprises a second portion of resources that are configured to: receive data values that are routed along the first or second load-store data path ([Para 0209] Nurvitadhi “In one embodiment the control processor 1802 receives requests for incoming compute operations 1801 to be satisfied by the computational logic within one or more of the compute and memory controller units 1432A - 1432B.”); and temporarily store the data values for a threshold number of processor cycles ([Para 0070 and 0209] Nurvitadhi “DMA controller 1803 can be used to transfer the data associated with the range of addresses from the different modules of the GPGPU local memory 1434A - 1434B to a single module, with at least a portion of the data being stored in one or more cache memories 1806A - 1806B within the primary memory controller 1805. The compute and memory controller units 1432A - 1432B can then perform the required arithmetic operations to data stored in the cache memories 1806A - 1806B , which may then be evicted back to the GPGPU local memory 1434A - 1434B .”).

Regarding Claim 5
Nurvitadhi in view of Phelps and Mill discloses: The circuit of claim 4, wherein the second portion of resources are configured to: provide the data values to the first vector register of the first processor core or the second vector register of the second processor core in response to temporarily storing the data values for the threshold number of processor cycles ([Para 0232] Nurvitadhi “In one embodiment the sparse compute accelerator architecture 2100 is configured to operate on an arbitrarily large set input data (e.g., matrix, vector) that resides in external (e.g., off chip) memory, such as the GPGPU local memory 1434 A - 1434B as in FIG. 14.”).

Regarding Claim 6
Nurvitadhi in view of Phelps and Mill discloses: The circuit of claim 1, wherein the shared memory comprises: a software-controlled staging resource that is formed from a subset of memory resources of the shared memory, the software-controlled staging resource is used to manage a [[the]] flow of data values from the first memory to the first vector register of the first processor core or the second vector register of the second processor core ([Para 0075] Nurvitadhi “The graphics multiprocessor 234 has an execution pipeline including but not limited to an instruction cache 252, an instruction unit 254 , an address mapping unit 256 , a register file 25 , one or more general purpose graphics processing unit  GPGPU) cores 262, and one or more load/store units 266”).

Regarding Claim 7
Nurvitadhi in view of Phelps and Mill discloses: The circuit of claim 6, wherein the circuit comprises a matrix computation unit configured to perform a subset of the computations to generate accumulated values that are used to generate the output for the layer of the neural network ([Para 0195 and Fig 22], Mills “In one embodiment the sparse compute accelerator unit 1423 is configured to perform matrix multiplications for neural networks having sparse weight values”).

Regarding Claim 8
Nurvitadhi in view of Phelps and Mill discloses: The circuit of claim 7, wherein the software-controlled staging resource is used to manage the flow of the data values corresponding to vector arrays from the first memory to the matrix computation unit, wherein the vector arrays are derived from the data values provided by the first memory ([Para 0298] Nurvitadhi “In some embodiments , execution units 3152A 3152B are an array of vector processors having an instruction set for performing graphics and media operations. In some embodiments, execution units 3152 A - 3152B have an attached L1 cache 3151 that is specific for each array or shared between the arrays.”).

Regarding Claim 9
Nurvitadhi in view of Phelps and Mill discloses: The circuit of claim 1, wherein: the circuit comprises a vector processing unit that communicates with the first memory ([Para 0046, Para 0077, and Fig 2D “GPGPU core”] Nurvitadhi “the one or more parallel processor(s) 112 form a computationally focused parallel or vector processing system that can include a large number of processing cores and/or processing clusters”); the vector processing unit is configured to generate a vector of activation values from accumulated values generated at the circuit; and the vector of activation values corresponds to the output for the layer of the neural network ([Para 0202 and Fig 16] Nurvitadhi “The combination of the data points within the output buffer 1606 represents an activation map generated by the convolution . Each point within the activation map is generated by sliding the receptive field tile across the input volume buffer 1604. The activation map data can be input to an activation function to determine an output activation value”).

Regarding Claim 11
Nurvitadhi in view of Phelps and Mill discloses: The circuit of claim 1, wherein the shared memory is configured to function as a shared-global memory space comprising memory resources corresponding to memory banks that are shared between one or more processor cores of a [[the]] plurality of processor cores ([Fig 2D, Fig 3A, Fig 16, and Fig 20], Nurvitadhi states in para 0084 “The graphics processor includes multiple sets of execution resources 356A 356D, where each set of execution resource includes multiple instruction units , register files, GPGPU cores, and load store units, as illustrated in FIG . 2D and FIG. 3A.”).

Regarding Claim 12
Nurvitadhi in view of Phelps and Mill discloses: The circuit of claim 1, wherein the data for performing computations to generate the output for a first layer of the neural network comprises: inputs to be processed through the first layer of the neural network ([Para 0165] “the first convolutional layer 904 of FIG. 9A can output to the second convolutional layer 906, while the second convolutional layer can output to a first layer of the fully connected layers 908.”); a respective set of weights for the first layer of the neural network ([Para 0139] “data received at the nodes of an input layer of a feedforward network are propagated (i.e., “fed forward ”) to the nodes of the output layer via an activation function that calculates the states of the nodes of each successive layer in the network based on coefficients (“weights”) respectively associated with each of the edges connecting the layers.”); and instructions for processing one or more of the inputs through the first layer using the respective set of weights for the first layer to generate the output for the first layer ([Para 0139] “Depending on the specific model being represented by the algorithm being executed, the output from the neural network algorithm can take various forms.”).

Regarding Claim 16
Nurvitadhi in view of Phelps and Mill discloses: The method of claim 13[[15]], wherein the circuit comprises a plurality of processor cores and the shared memory comprises a plurality of memory resources that are physically distributed about the circuit and the method comprises: using the plurality of memory resources of the shared memory to exchange data communications between the first memory and each of the plurality of processor cores ([Fig 2D “Shared Memory 270”], Nurvitadhi discloses in Fig 2D Shared memory resource that is connected to Cache Memory 272 (i.e. memory resources) and GPGPU Cores 262 (i.e. processor cores).[Fig 14]).

Regarding Claim 17
Nurvitadhi in view of Phelps and Mill discloses: The method of claim 16, wherein the shared memory comprises a shared memory control unit and the method comprises: causing a first portion of resources of the plurality of memory resources to function as a DMA memory unit based on instructions executed by the shared memory control unit ([Para 0208 and Fig 18], Nurvitadhi “the hybrid memory module 1430 additionally includes a control processor 1802 and a primary memory controller 1805. The control processor 1802 and primary memory controller 1805 can work in concert with a DMA controller 1803 to enable a DMA memory transfer of data to, from, and between modules of the GPGPU local memory 1434A - 1434B.”); and using a representative DMA function of the first portion of resources to move data between the first memory and each of the first processor core and the second processor core ([Para 0051 and Fig 3], Mills “Kernel DMA 324 is a read circuit that fetches kernel data from a source ( e.g. , system memory 230 ) and sends kernel data 326A through 326N to each of the neural engines 314.”.

Regarding Claim 18
Nurvitadhi in view of Phelps and Mill discloses: The method of claim 17, comprising: receiving, by a second portion of resources of the plurality of memory resources, the third vector values and the fourth vector values that are routed along the first and second load-store data path respectively; temporarily storing, using the second portion of resources, the third vector values for a threshold number of processor cycles; and temporarily storing, using the second portion of resources, the fourth vector values for a threshold number of processor cycles ([Para 0070 and 0209] Nurvitadhi “DMA controller 1803 can be used to transfer the data associated with the range of addresses from the different modules of the GPGPU local memory 1434A - 1434B to a single module, with at least a portion of the data being stored in one or more cache memories 1806A - 1806B within the primary memory controller 1805. The compute and memory controller units 1432A - 1432B can then perform the required arithmetic operations to data stored in the cache memories 1806A - 1806B , which may then be evicted back to the GPGPU local memory 1434A - 1434B .”).

Regarding Claim 19
Nurvitadhi in view of Phelps and Mill discloses: The method of claim 18, comprising: providing, using the second portion of resources, the third second vector values to the respective vector register of the first processor core in response to temporarily storing the second third vector values for the threshold number of processor cycles; and providing, using the second portion of resources, the fourth vector values to the respective vector register of the second processor core in response to temporarily storing the fourth vector values for the threshold number of processor cycles ([Para 0070 and 0209] Nurvitadhi “DMA controller 1803 can be used to transfer the data associated with the range of addresses from the different modules of the GPGPU local memory 1434A - 1434B to a single module, with at least a portion of the data being stored in one or more cache memories 1806A - 1806B within the primary memory controller 1805. The compute and memory controller units 1432A - 1432B can then perform the required arithmetic operations to data stored in the cache memories 1806A - 1806B , which may then be evicted back to the GPGPU local memory 1434A - 1434B .”).

Regarding Claim 20
Nurvitadhi in view of Phelps and Mill discloses: The method of claim 13, wherein the shared memory comprises a software-controlled staging resource formed from a subset of memory resources of the shared memory, and the method comprises: managing, using the software-controlled staging resource, data flows from the first memory to the respective first vector register of the first processor core and data flows from the first memory to the second respective vector register of the second processor core ([Fig. 2D (266)] Nurvitadhi “one or more general purpose graphics processing unit (GPGPU) cores 262, and one or more load/store units 266. The GPGPU cores 262 and load/store units 266 are coupled with cache memory 272 and shared memory 270 via a memory and cache interconnect 268.” [Para 0077] “The register file 258 provides a set of registers for the functional units of the graphics multiprocessor 234. The register file 258 provides temporary storage for operands connected to the data paths of the functional units (e.g., GPGPU cores 262, load/store units 266) of the graphics multiprocessor 234.” Examiner reads the operands stored by the register file 258 as a respective vector register in the processor core (i.e. GPGPU).

Regarding Claim 21
Nurvitadhi in view of Phelps and Mill discloses: The method of claim 20, wherein the circuit comprises a matrix computation unit and the method comprises: generating, using the matrix computation unit ([Para 0195 and Fig 22] Nurvitadhi “In one embodiment the sparse compute accelerator unit 1423 is configured to perform matrix multiplications for neural networks having sparse weight values”), accumulated values in response to performing a subset of the computations to generate the output for the neural network layer [Para 0195 and Fig 22] Nurvitadhi “In one embodiment the sparse compute accelerator unit 1423 is configured to perform matrix multiplications for neural networks having sparse weight values”).

Regarding Claim 22
Nurvitadhi in view of Phelps and Mill discloses: The method of claim 21, comprising: managing, using the software-controlled staging resource, data flows from the first memory to the matrix computation unit, wherein the data flows comprise vector arrays that are derived from the data provided by the first memory  ([Para 0298] Nurvitadhi “In some embodiments , execution units 3152A 3152B are an array of vector processors having an instruction set for performing graphics and media operations .In some embodiments, execution units 3152 A - 3152B have an attached L1 cache 3151 that is specific for each array or shared between the arrays.”).

Regarding Claim 23
Nurvitadhi in view of Phelps and Mill discloses: The method of claim 21, wherein: the circuit comprises a vector processing unit intermediate the first memory and the matrix computation unit; the method comprises generating, by the vector processing unit, a vector of activation values from the accumulated values generated by the matrix computation unit; and the vector of activation values corresponds to the output for the neural network layer ([Para 0202 and Fig 16] Nurvitadhi “The combination of the data points within the output buffer 1606 represents an activation map generated by the convolution . Each point within the activation map is generated by sliding the receptive field tile across the input volume buffer 1604. The activation map data can be input to an activation function to determine an output activation value”).

Regarding Claim 26
Nurvitadhi in view of Phelps and Mill discloses: The circuit of claim 6, wherein the software-controlled staging resource is configured to load data from the shared memory in a first phase, and provide the loaded data to the first or second vector register in a second phase ([Para 0075-0078 and Fig 2D] Nurvitadhi “The graphics multiprocessor 234 has an execution pipeline including but not limited to an instruction cache 252, an instruction unit 254, an address mapping unit 256, a register file 258, one or more general purpose graphics processing unit (GPGPU) cores 262, and one or more load/store units 266. The GPGPU cores 262 and load/store units 266 are coupled with cache memory 272 and shared memory 270 via a memory and cache interconnect 268.”[Para 0076] “.An instruction can access any of a local, shared, or global address space by specifying an address within a unified address space. The address mapping unit 256 can be used to translate addresses in the unified address space into a distinct memory address that can be accessed by the load / store units 26.” Examiner reads load/store unit 266 as loading data from shared memory 270 and providing the data to register 258.).

Regarding Claim 27
Nurvitadhi in view of Phelps and Mill discloses: The method of claim 20, wherein the software-controlled staging resource is configured to load data from the shared memory in a first phase, and provide the loaded data to the first or second vector register in a second phase ([Para 0075-0078 and Fig 2D] Nurvitadhi “The graphics multiprocessor 234 has an execution pipeline including but not limited to an instruction cache 252, an instruction unit 254, an address mapping unit 256, a register file 258, one or more general purpose graphics processing unit (GPGPU) cores 262, and one or more load/store units 266. The GPGPU cores 262 and load/store units 266 are coupled with cache memory 272 and shared memory 270 via a memory and cache interconnect 268.”[Para 0076] “.An instruction can access any of a local, shared, or global address space by specifying an address within a unified address space. The address mapping unit 256 can be used to translate addresses in the unified address space into a distinct memory address that can be accessed by the load / store units 26.” Examiner reads load/store unit 266 as loading data from shared memory 270 and providing the data to register 258.).

Regarding Claim 28
Nurvitadhi in view of Phelps and Mill discloses: The non-transitory machine-readable storage device of claim 24, wherein the shared memory comprises a software-controlled staging resource formed from a subset of memory resources of the shared memory, wherein the software-controlled staging resource is configured to load data from the shared memory in a first phase, and provide the loaded data to the first or second vector register in a second phase ([Para 0075-0078 and Fig 2D] Nurvitadhi “The graphics multiprocessor 234 has an execution pipeline including but not limited to an instruction cache 252, an instruction unit 254, an address mapping unit 256, a register file 258, one or more general purpose graphics processing unit (GPGPU) cores 262, and one or more load/store units 266. The GPGPU cores 262 and load/store units 266 are coupled with cache memory 272 and shared memory 270 via a memory and cache interconnect 268.”[Para 0076] “.An instruction can access any of a local, shared, or global address space by specifying an address within a unified address space. The address mapping unit 256 can be used to translate addresses in the unified address space into a distinct memory address that can be accessed by the load / store units 26.” Examiner reads load/store unit 266 as loading data from shared memory 270 and providing the data to register 258.).

Claim 10 is/are rejected under 35 U.S.C. 103 as being unpatentable over Nurvitadhi et al. (US 20180315158, hereinafter "Nurvitadhi") in view of Phelps et al. (US 20180336165 A1, hereinafter "Phelps"), Mills et al. (US 2019/0340486, hereinafter "Mills") and Gu (US20200151088, hereinafter "Gu").

Regarding Claim 10
Nurvitadhi in view of Phelps and Mill discloses: The circuit of claim 6, 
Nurvitadhi in view of Phelps and Mill does not explicitly disclose: wherein: the software-controlled staging resource is a first-in-first-out (FIFO) memory structure along a load section of the load-store data path; and the FIFO memory structure is configured to temporarily store a vector of values for a threshold number of processor cycles before routing the vector of values to the first vector register of the first processor core or the second vector register of the second processor core.
However, Gu discloses in the same field of endeavor: wherein: the software-controlled staging resource is a first-in-first-out (FIFO) memory structure along a load section of the load-store data path ([Para 0035] “For example, for a series type of DNN, the architecture code may define at least one convolutional (Conv) processor and at least one fully connected (FC) processor that are interconnected by a memory module, such as a First In First Out ( FIFO) memory module.”); and the FIFO memory structure is configured to temporarily store a vector of values for a threshold number of processor cycles before routing the vector of values to the first vector register of the first processor core or the second vector register of the second processor core ([Para 0149] “one instruction may execute a conv layer in the Conv processor 124. The execution of the conv layer, however, takes many cycles. The instruction may direct the Conv processor 124 to emit for example a Done signal, e.g., to the inter-processor FIFO 306 when the execution finishes. The Conv processor 124 may be blocked from performing another operation/layer until the data has been moved, e.g., into the FIFO 306 or the Conv buffer module 314.”).
It would have been obvious of one of skill in the art at the time of filing to combine Nurvitadhi, Phelps, Mills, and Gu. Doing so may interconnect processor cores with First in First out (FIFO) memory (Abstract, Gu).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Lacy et al. (US 10261796 B2) also describes a processor core with vector memory and vector registers (Col 8 line 1-30).
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to TEWODROS E MENGISTU whose telephone number is (571)270-7714. The examiner can normally be reached Mon-Fri 9:30-5:30.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, ABDULLAH KAWSAR can be reached on (571)270-3169. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/TEWODROS E MENGISTU/Examiner, Art Unit 2127                                                                                                                                                                                                        
/ABDULLAH AL KAWSAR/Supervisory Patent Examiner, Art Unit 2127