DETAILED ACTION
Status of Claims 
Claims 1-20 have been considered. It is hereby acknowledged that the following papers have been received and placed of record in the file:
Abstract 							-Receipt Date 07/15/2020
Application Data Sheet 						-Receipt Date 07/15/2020
Claims 								-Receipt Date 07/15/2020
Drawings-only black and white line drawings			-Receipt Date 07/15/2020
Information Disclosure Statement (IDS) 				-Receipt Date 07/15/2020
Specification							-Receipt Date 07/15/2020

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 07/15/2020 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Objections
Claims 2-7 and 15-20 are objected to because of the following informalities:  
Claims 2 and 3” simultaneously operates” should be “simultaneously operate”
Claim 2 and 15- “the group load register file” should be “a group load register file”
Claim 3, 5, and 16- “wherein” should be preceded by a comma
Claims 3 and 16- “the plurality of source operands” should be “a plurality of source operands”
Claim 3 and 16- “the weight register file” should be “a weight register file”
Claim 4 and 17- “the SOMAC operation” should be “the SOMAC instruction”
Claims 5 and 18- “the plurality of operands” should be “a plurality of operands”
Claim 5 and 18- “the multiply-accumulate result” should be “a multiply-accumulate result”
Claim 5 and 18- “the output register file” should be “an output register file”
Claim 5 and 18- “the instruction” should be “the SOMAC instruction”
Claim 6 and 19- “a number of iterations” should be “the number of iterations”
Claim 6- “a second subset of inputs of the output tensor” should be “a second subset of outputs of the output tensor”
Claims 9 and 10- “the output registers” should be “output registers”
Claim 18- remove space preceding the comma in “adding ,” and “writing ,”
Claim 20- “the group register” should be “the group load register”
Appropriate correction is required.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

s 1-6 and 14-19 are rejected under 35 U.S.C. 103 as being unpatentable over Shao et al. US 2020/0293867 (hereinafter Shao) in view of Wilder et al. US 2010/0274990 (hereinafter, Wilder).
Regarding claim 1, Shao teaches:
1. A graph streaming processor (Fig. 2), comprising: 
a data cache, the data cache comprising an input tensor, a weight tensor and an output tensor ([0063] and [0090]: input activations, weights, and partial sums, i.e. an input tensor, weight tensor, and output tensor respectively, are stored in global L2 cache); 
a plurality of processors ([0037]: vector mac units 402); 
a group load register file operative to load a subset of inputs of the input tensor ([0029] and [0052]: the input collector 528 is a group load register file which loads a subset of the input activations received from cache), wherein the group load register file provides the subset of the inputs of the input tensor to all of the plurality of processors ([0040] and [0049]: the same input is provided to each of the vector macs, see also Fig. 5 showing the collector 528 providing inputs to all the vector mac units 402); 
a plurality of weight data registers operative to load a subset of weights of the weight tensor ([0029] and [0048]: a register in each of the weight collectors of each of the vector macs is a plurality of weight data registers which load a subset of the weights received from cache), wherein each of the plurality of weight data registers provide a weight to a single of the plurality of processors ([0029] and [0048]: each weight collector, including the registers in each weight collector, provides a weight to its respective vector mac unit, see also [0040]); 
the plurality of processors operative to perform a SOMAC (Sum-Of- Multiply-Accumulate) operation ([0040]-[0041]: the vector macs each perform a mac operation and operate collectively to perform a sum of multiply accumulate operation in which the outputs of each vector mac are accumulated)
	While Shao teaches input collectors and weight collectors which are register files ([0029]), Shao does not explicitly teach a register of the input collector loading a subset of inputs. Further, while Shao teaches performing a SOMAC operation, Shao does not explicitly teach a SOMAC instruction. That is, Shao does not teach:
		a group load data register to load a subset of the input tensor;
a SOMAC instruction, including each of the plurality of processors simultaneously operating to: 
determine an instruction size of the SOMAC instruction, wherein the instruction size indicates a number of iterations that the SOMAC instruction is to be executed and is equal to a number of outputs within a subset of a plurality of outputs of the output tensor.
	However, Wilder teaches:
an input register to its SIMD MAC circuit ([0073]: a vector of coefficient data elements is provided by a SIMD register); 
a SOMAC instruction ([0062]: a repeating MAC is provided, i.e. a SOMAC instruction): 
indicating an instruction size of the SOMAC instruction, wherein the instruction size indicates a number of iterations that the SOMAC instruction is to be executed ([0062]: a scalar value M in the instruction indicates a number of iterations the repeating MAC is to be executed) and is equal to a number of outputs ([0073]: the scalar value is equal to the number of MAC results N)


	Regarding claim 2, Shao in view of Wilder teaches:
2. The graph streaming processor of claim 1, wherein each of the plurality of processors further simultaneously operates to: 
read a first source operand of a plurality of source operands of the SOMAC instruction from the group load register file (Wilder [0062]: the repeating mac instruction includes a source operand vc which is read from the register file/input collector of Shao simultaneously in the combination, see Shao [0029] and [0052]), wherein the first source operand is one of the subset of inputs of the input tensor (Shao [0048]-[0049]: the vector macs operate on the same inputs which is a subset of the inputs stored in the activation buffer/cache, see also [0090]).

	Regarding claim 3, Shao in view of Wilder teaches:
3. The graph streaming processor of claim 1, wherein each of the plurality of processors further simultaneously operates to: 
read a second source operand of the plurality of source operands of the SOMAC instruction from the weight register file (Wilder [0062]: the repeating mac instruction includes a source operand vd which is read from the register file/weight collector of Shao simultaneously in the combination, see Shao [0048]) wherein the second source operand is one of the subset of weights of the weight tensor (Shao [0048]-[0049]: the weights the vector macs operate on are a subset of the weights stored in the weight buffer/cache, see also [0090]).

	Regarding claim 4, Shao in view of Wilder teaches:
4. The graph streaming processor of claim 1, wherein each of the plurality of processors further simultaneously operate to: 
execute multiply and accumulate operations of the SOMAC operation for the number of iterations (Shao [0040]-[0041]: the vector macs perform multiple multiply and accumulate operations each cycle, i.e. simultaneously, and, in the combination with Wilder, the mac operations are performed for a number of iterations of a repeating mac instruction, see Wilder [0062] and [0073]).

	Regarding claim 5, Shao in view of Wilder teaches:
5. The graph streaming processor of claim 4, wherein each of the plurality of processors further simultaneously operate to: 
read a destination operand of the plurality of operands of the SOMAC instruction from one of a plurality of output registers (Wilder [0062]: the repeating mac instruction reads destination operand vacc from one of a plurality of registers in the accumulation collector, in the combination, see Shao [0029] and [0045], and stores a result into vacc) wherein the destination operand is one of the subset of outputs of the output tensor (Shao [0054]: the partial sum, stored in vacc in the combination, is one of the subsets of outputs produced by the vector mac if the partial sum/output tensor stored in cache, see also Shao [0090]); 
add a sum-of-multiply result to the destination operand (Wilder [0064]: a sum of multiply result is added to vacc); 
write the multiply-accumulate result back to the destination operand (Wilder [[0064]: the result is written back to vacc), wherein the destination operand is a register from the output register file that is an output of the instruction (Wilder [0062] and [0064]: vacc is output from the repeating mac instruction and is a register from the register file of the accumulation collector, see Shao [0029] and [0045] and further Wilder [0082] describing the SIMD accumulate registers).

	Regarding claim 6, Shao in view of Wilder teaches: 
6. The graph streaming processor of claim 1, wherein the graph streaming processor further includes a second plurality of processors (Shao [0033] and [0037]: a dice includes a plurality of processing elements where the plurality of vector mac units on a second processing element is a second plurality of processors), wherein the graph streaming processor further comprises: 
a second group load register operative to load a second subset of the inputs of the input tensor(Shao [0029] and [0052]: a register from an input collector of the second processing element is a group load register which loads a second subset of the input activations received from cache, see also Fig. 8 showing that different processing elements receive different input activation subsets, see also Wilder [0062] which teaches an input register vc), wherein the second group load register provides the second subset of the inputs of the input tensor to all of the second plurality of processors (Shao [0040] and [0049]: the same input, which is a register in the combination, is provided to each of the vector macs, see also Fig. 5 showing the collector 528 providing inputs to all the vector mac units 402); 
a second plurality of weight registers operative to load a second subset of weights of the weight tensor (Shao [0029] and [0048]: a register in each of the weight collectors of each of the vector macs in the second processing is a second plurality of weight data registers which load a second subset of the weights received from cache, see Shao [0065] describing that each processing element stores a portion of the weights), wherein each of the second plurality of weight data registers provide a weight to a single of the second plurality of processors (Shao [0029] and [0048]: each weight collector, including the registers in each weight collector, provides a weight to its respective vector mac unit, see also [0040]); 
wherein the second plurality of processors operate to perform the SOMAC (Sum-Of-Multiply-Accumulate) (Shao [0040]-[0041]: the vector macs each perform a mac operation and operate collectively to perform a sum of multiply accumulate operation in which the outputs of each vector mac are accumulated) instruction (Wilder [0062]: the somac operation in Shao is performed using a repeating mac instruction taught by Wilder in the combination), including each of the second plurality of processors simultaneously operating to: 
determine the instruction size of the SOMAC instruction, wherein the instruction size indicates a number of iterations that the SOMAC instruction is to be executed (Wilder [0062]: a scalar value M in the instruction indicates a number of iterations the repeating MAC is to be executed, in the combination, the vector mac units of Shao are modified to simultaneously determine this scalar included in the instruction)  and is equal to a number of outputs within a second subset of inputs of the output tensor (Wilder [0073]: the scalar value is equal to the number of MAC results N, which, in the combination, is within a second subset of inputs including the partial results produced by all the vector macs of the second processing element of the partial results of the output tensor produced by all the processing elements).

	Regarding claim 14, Shao teaches:
14. A method of graph streaming processing, comprising: 
loading, by a group load register file, a subset of inputs of an input tensor ([0029] and [0052]: the input collector 528 is a group load register file which loads a subset of the input activations received from cache) from a data cache ([0063] and [0090]: input activations, weights, and partial sums, i.e. an input tensor, weight tensor, and output tensor respectively, are stored in global L2 cache), wherein the group load register file provides the subset of inputs of the input tensor to all of a plurality of processors ([0040] and [0049]: the same input is provided to each of the vector macs, see also Fig. 5 showing the collector 528 providing inputs to all the vector mac units 402); 
loading, by a plurality of weight data registers, a subset of weights of a weight tensor ([0029] and [0048]: a register in each of the weight collectors of each of the vector macs is a plurality of weight data registers which load a subset of the weights received from cache), wherein each of the weight data registers provide a weight to a single of the plurality of processors ([0029] and [0048]: each weight collector, including the registers in each weight collector, provides a weight to its respective vector mac unit, see also [0040]); 
performing, by the plurality of processors, a SOMAC (Sum-Of-Multiply- Accumulate) operation ([0040]-[0041]: the vector macs each perform a mac operation and operate collectively to perform a sum of multiply accumulate operation in which the outputs of each vector mac are accumulated)
	While Shao teaches input collectors and weight collectors which are register files ([0029]), Shao does not explicitly teach a register of the input collector loading a subset of inputs. Further, while Shao teaches performing a SOMAC operation, Shao does not explicitly teach a SOMAC instruction. That is, Shao does not teach:
loading, by a group load register, a subset of inputs of an input tensor
performing a SOMAC (Sum-Of-Multiply- Accumulate) instruction, including: 
simultaneously determining, by each of the plurality of processors, an instruction size of the SOMAC instruction, wherein the instruction size indicates a number of iterations that the SOMAC instruction is to be executed and is equal to a number of outputs within a subset of an output tensor.
	However, Wilder teaches:
an input register to its SIMD MAC circuit ([0073]: a vector of coefficient data elements is provided by a SIMD register); 
a SOMAC instruction ([0062]: a repeating MAC is provided, i.e. a SOMAC instruction): 
indicating an instruction size of the SOMAC instruction, wherein the instruction size indicates a number of iterations that the SOMAC instruction is to be executed ([0062]: a scalar value M in the instruction indicates a number of iterations the repeating MAC is to be executed) and is equal to a number of outputs ([0073]: the scalar value is equal to the number of MAC results N)
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the processing elements of Shao to use a register from the input collector to provide inputs to the vector macs as taught by Wilder and to further modify the vector mac 

	Regarding claim 15, Shao in view of Wilder teaches:
15. The method of claim 14, further comprising: 
reading, by each of the plurality of processors, a first source operand of a plurality of source operands of the SOMAC instruction from the group load register file (Wilder [0062]: the repeating mac instruction includes a source operand vc which is read from the register file/input collector of Shao simultaneously in the combination, see Shao [0029] and [0052]), wherein the first source operand is one of the subset of inputs of the input tensor (Shao [0048]-[0049]: the vector macs operate on the same inputs which is a subset of the inputs stored in the activation buffer/cache, see also [0090]).

	Regarding claim 16, Shao in view of Wilder teaches:
16. The method of claim 14, further comprising: 
reading, by each of the plurality of processors, a second source operand of the plurality of source operands of the SOMAC instruction from the weight register file (Wilder [0062]: the repeating mac instruction includes a source operand vd which is read from the register file/weight collector of Shao simultaneously in the combination, see Shao [0048]) wherein the second source operand is one of the subset of the weights of the weight tensor (Shao [0048]-[0049]: the weights the vector macs operate on are a subset of the weights stored in the weight buffer/cache, see also [0090]).

	Regarding claim 17, Shao in view of Wilder teaches:
17. The method of claim 14, further comprising: 
executing, by each of the plurality of processors, multiply and accumulate operations of the SOMAC operation for the number of iterations (Shao [0040]-[0041]: the vector macs perform multiple multiply and accumulate operations each cycle, i.e. simultaneously, and, in the combination with Wilder, the mac operations are performed for a number of iterations of a repeating mac instruction, see Wilder [0062] and [0073]).

	Regarding claim 18, Shao in view of Wilder teaches:
18. The method of claim 17, further comprising: 
reading, by each of the plurality of processors, a destination operand of the plurality of operands of the SOMAC instruction from the output register file (Wilder [0062]: the repeating mac instruction reads destination operand vacc from one of a plurality of registers in the accumulation collector/output register file, in the combination, see Shao [0029] and [0045], and stores a result into vacc), wherein the destination operand is one of the subset of outputs of the output tensor (Shao [0054]: the partial sum, stored in vacc in the combination, is one of the subsets of outputs produced by the vector mac if the partial sum/output tensor stored in cache, see also Shao [0090]); 
adding, by each of the plurality of processors, a sum-of-multiply result to the destination operand (Wilder [0064]: a sum of multiply result is added to vacc); 
writing, by each of the plurality of processors, the multiply-accumulate result back to the destination operand (Wilder [[0064]: the result is written back to vacc), wherein the destination operand is a register from the output register file that is an output of the instruction (Wilder [0062] and [0064]: vacc is output from the repeating mac instruction and is a register from the register file of the accumulation collector, see Shao [0029] and [0045] and further Wilder [0082] describing the SIMD accumulate registers).

	Regarding claim 19, Shao in view of Wilder teaches:
19. The method of claim 14, further comprising: 
loading, by a second group load register, a second subset of the inputs of the input tensor (Shao [0029] and [0052]: a register from an input collector of the second processing element is a group load register which loads a second subset of the input activations received from cache, see also Fig. 8 showing that different processing elements receive different input activation subsets, see also Wilder [0062] which teaches an input register vc), wherein the second group load register provides the second subset of the inputs of the input tensor to all of a second plurality of processors (Shao [0040] and [0049]: the same input, which is a register in the combination, is provided to each of the vector macs, see also Fig. 5 showing the collector 528 providing inputs to all the vector mac units 402; Shao [0033] and [0037]: a dice includes a plurality of processing elements where the plurality of vector mac units on a second processing element is a second plurality of processors); 
loading, by a second plurality of weight registers, a second subset of the weights of the weight tensor (Shao [0029] and [0048]: a register in each of the weight collectors of each of the vector macs in the second processing is a second plurality of weight data registers which load a second subset of the weights received from cache, see Shao [0065] describing that each processing element stores a portion of the weights), wherein each of the second plurality of weight data registers provide a weight to a single of the second plurality of processors (Shao [0029] and [0048]: each weight collector, including the registers in each weight collector, provides a weight to its respective vector mac unit, see also [0040]); and 
performing, by the second plurality of processors, the SOMAC (Sum-Of- Multiply-Accumulate) (Shao [0040]-[0041]: the vector macs each perform a mac operation and operate collectively to perform a sum of multiply accumulate operation in which the outputs of each vector mac are accumulated) instruction (Wilder [0062]: the somac operation in Shao is performed using a repeating mac instruction taught by Wilder in the combination), including each of the second plurality of processors simultaneously determining the instruction size of the SOMAC instruction, wherein the instruction size indicates a number of iterations that the SOMAC instruction is to be executed (Wilder [0062]: a scalar value M in the instruction indicates a number of iterations the repeating MAC is to be executed, in the combination, the vector mac units of Shao are modified to simultaneously determine this scalar included in the instruction)  and is equal to a number of outputs within a second subset of the output tensor (Wilder [0073]: the scalar value is equal to the number of MAC results N, which, in the combination, is within a second subset of inputs including the partial results produced by all the vector macs of the second processing element of the partial results of the output tensor produced by all the processing elements).

s 7-13 and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Shao et al. US 2020/0293867 (hereinafter Shao) in view of Wilder et al. US 2010/0274990 (hereinafter, Wilder) and Faanes et al. US 2012/0221830 (hereinafter, Faanes).
	Regarding claim 7, Shao in view of Wilder teaches:
7. The graph streaming processor of claim 1, 
	Shao in view of Wilder does not explicitly teach:
wherein a size of the group load register is dependent on a number of inputs within the subset of inputs of the input tensor.
	However, Faanes teaches configuring a size of a vector register ([0025]: the register space may be repartitioned including changing the maximum vector register length)
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the group load register of Shao in view of Wilder to be reconfigurable as taught by Faanes such that the size of the group load register is changed to match the number of inputs V of the subset of input activations of Shao. One of ordinary skill in the art would have been motivated to make this modification to allow for greater execution efficiency (Faanes [0026]), for example, by eliminating empty space within the register. 

	Regarding claim 8, Shao in view of Wilder teaches:
8. The graph streaming processor of claim 1, including a number of threads concurrently running on the plurality of processors (Shao [0127]: a number of threads run concurrently on the streaming multiprocessor implementation)
	Shao in view of Wilder does not explicitly teach:
wherein a size of the group load register is dependent on a number of threads concurrently running on the plurality of processors.
	However, Faanes teaches configuring a size of a vector register dependent on a threaded mode ([0025]: the register space may be repartitioned including changing the maximum vector register length; [0019]: the registers may have a maximum length of 16 in vector threaded mode or 1 in scalar threaded mode)
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the group load register of Shao in view of Wilder to be reconfigurable as taught by Faanes such that the register space of the group load register is divided evenly among the threads and the size of the group load register is based on the number of threads. One of ordinary skill in the art would have been motivated to make this modification to allow for greater execution efficiency (Faanes [0026]), for example, by ensuring each thread receives the same amount of input in the group load register. 

	Regarding claim 9, Shao in view of Wilder teaches:
9. The graph streaming processor of claim 1, 
	Shao in view of Wilder does not teach: 
wherein a size of the output registers is dependent on a number of outputs within the subset of outputs of the output tensor.
	However, Faanes teaches configuring a size of a vector register ([0025]: the register space may be repartitioned including changing the maximum vector register length)
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify an output register in the accumulation collector of Shao to be reconfigurable as taught by Faanes such that the size of the output register is changed to match the number of outputs from the vector macs. One of ordinary skill in the art would have been motivated to 

	Regarding claim 10, Shao in view of Wilder teaches:
10. The graph streaming processor of claim 1, including a number of threads concurrently running on the plurality of processors (Shao [0127]: a number of threads run concurrently on the streaming multiprocessor implementation)
	Shao in view of Wilder does not teach:
wherein a size of the output registers is dependent on a number of threads concurrently running on the plurality of processors.
	However, Faanes teaches configuring a size of a vector register dependent on a threaded mode ([0025]: the register space may be repartitioned including changing the maximum vector register length; [0019]: the registers may have a maximum length of 16 in vector threaded mode or 1 in scalar threaded mode)
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the output register in the accumulation collector of Shao in view of Wilder to be reconfigurable as taught by Faanes such that the register space of the accumulation collector is divided evenly among the threads and the size of the output register is based on the number of threads. One of ordinary skill in the art would have been motivated to make this modification to allow for greater execution efficiency (Faanes [0026]), for example, by ensuring each thread receives the same amount of input in the group load register. 

	Regarding claim 11, Shao in view of Wilder teaches:
11. The graph streaming processor of claim 1, 

wherein a size of the weight registers is dependent on a number of inputs within the subset of inputs of the input tensor.
	However, Faanes teaches configuring a size of a vector register ([0025]: the register space may be repartitioned including changing the maximum vector register length)
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the weight registers in the weight collector of Shao to be reconfigurable as taught by Faanes such that the size of the weight registers is changed to match the number of input activations V sent to the vector macs. One of ordinary skill in the art would have been motivated to make this modification to allow for greater execution efficiency (Faanes [0026]), for example, by eliminating empty space within the register.

	Regarding claim 12, Shao in view of Wilder teaches:
12. The graph streaming processor of claim 1, 
	While Shao teaches the weight matrix having dimensions NxV and an accumulation collector having width NxAP ( [0046] and [0048]), Shao in view of Wilder does not teach:
wherein a size of the weight registers is dependent on a number of outputs within the subset of outputs of the output tensor.
	However, Faanes teaches configuring a size of a vector register ([0025]: the register space may be repartitioned including changing the maximum vector register length)
It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the weight registers in the weight collector of Shao to be reconfigurable as taught by Faanes such that the number/size of the weight registers is changed to 

Regarding claim 13, Shao in view of Wilder teaches:
13. The graph streaming processor of claim 1, including a number of threads concurrently running on the plurality of processors (Shao [0127]: a number of threads run concurrently on the streaming multiprocessor implementation)
	Shao in view of Wilder does not explicitly teach:
wherein a size of the weight registers is dependent on a number of threads concurrently running on the plurality of processors.
	However, Faanes teaches configuring a size of a vector register dependent on a threaded mode ([0025]: the register space may be repartitioned including changing the maximum vector register length; [0019]: the registers may have a maximum length of 16 in vector threaded mode or 1 in scalar threaded mode)
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the weight registers of Shao in view of Wilder to be reconfigurable as taught by Faanes such that the register space of the weight registers is divided evenly among the threads and the size of the weight registers are based on the number of threads. One of ordinary skill in the art would have been motivated to make this modification to allow for greater execution efficiency (Faanes [0026]), for example, by ensuring each thread receives the same amount of weights in the weight registers. 


20. The method of claim 14, 
	Shao in view of Wilder, as currently mapped, does not explicitly teach:
wherein a size of the group load register is dependent on a number of inputs within the subset of inputs of the input tensor.
	However, Faanes teaches configuring a size of a vector register ([0025]: the register space may be repartitioned including changing the maximum vector register length)
	It would have been obvious to one of ordinary skill in the art, before the effective filing date of the claimed invention, to modify the group load register of Shao in view of Wilder to be reconfigurable as taught by Faanes such that the size of the group load register is changed to match the number of inputs V of the subset of input activations of Shao. One of ordinary skill in the art would have been motivated to make this modification to allow for greater execution efficiency (Faanes [0026]), for example, by eliminating empty space within the register.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
US 20050033944- teaches performing fused multiply-adds over a number of iterations in a loop, see [0050]
 US 20180341495- teaches a plurality of vMACs with respect weight caches, see Abstract and Fig. 1


Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jyoti Mehta can be reached on (571) 270-3995. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/KASIM ALLI/Examiner, Art Unit 2183