DETAILED ACTION
Claims 1-16 and 18-28 have been examined.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on February 12, 2021, has been entered.

Information Disclosure Statement
In the IDS submitted on November 17, 2020, the foreign patent documents have not been considered (denoted by strike-through) because applicant has not provided copies of these references as required by 37 CFR 1.98(a)(2).  Applicant has submitted abstracts for corresponding U.S. applications, and select drawings, but not the foreign patent documents themselves.

Drawings
Replacement FIG.4 submitted on February 15, 2021, is objected to because of the following minor informalities:
In FIG.4, within box 53 there is a smaller box around the text “Instruction buffers”.  This smaller box needs to be deleted so that this smaller box is not considered an instruction buffer in addition to the originally shown instruction buffers.
In FIG.4, the right arrowhead is missing to the left of the “CSRL” text.  Please re-insert.  See original FIG.4.
In FIG.4, bottom left, the bottom input of the top multiplexer, and the top input of the bottom multiplexer appear to intersect.  Following the conventions used elsewhere in this drawing, applicant should use a gap here to indicate a lack of intersection.  See original FIG.4.
In FIG.4, bottom left, the output of the top multiplexer appears to intersect with another wire because the gap has been shifted down.  Please ensure that all gaps are appropriately drawn so that no lines intersect where improper.
In FIG.4, each instance of “64” needs to be re-inserted in the associated boxes (but flipped 180 degrees from the way they were originally drawn).  The 64s cannot be shown in the current manner because this is how reference numbers are illustrated, and applicant has no reference number 64 in the specification.
Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Please ensure any replacement drawing is in only black and white to avoid pixelation and further objection. Any amended 

Claim Objections
Claim 27 is objected to because of the following informalities:
In line 13, replace “a first” with --the first--.
Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):

(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.

The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:

The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 27-28 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor, or for pre-AIA  the applicant regards as the invention.
The claims recite the following limitations for which there is a lack of antecedent basis:
In claim 27, line 15, “the operation”.  Applicant could delete “of the operation” or replace “the operation” with --the first arithmetic instruction--.
In claim 27, 5th and 3rd to last lines, both instances of “the weight from a second one of the shared weights registers”.  Is applicant claiming the same weight accessed by the first instruction?  If so, two different registers have the same weight?  If not, is applicant merely trying to claim accessing the single weight that is in the second one of the shared weights registers?  If so, this should read --a weight from a second one of the shared weights registers--.  For prior art purposes, it is assumed that applicant is simply accessing a weight in a second weight register and not the same exact weight accessed by the first instruction that happens to be stored in another weight register.
In claim 27, last line, “the operation”.  Applicant could delete “of the operation”, or replace “the operation” with --the second arithmetic instruction--.
Claim 28 is rejected due to their dependence on an indefinite claim.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:


(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.

Claims 1, 3-6, 18, 20-23, and 27-28 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Elango, “Convolutional Neural Network Acceleration on GPU by Exploiting Data Reuse”, Spring 2017, 67 pages.
Referring to claim 1, Elango has taught a processor comprising:
a) one or more register files (see p.5, 2nd line under Table 1, and p.6, last full sentence.  Each thread has its own private set of registers.  Section 2.1.2 on p.6 shows an example where a register file may be partitioned into 32 banks, one bank for each thread); and
b) an execution unit configured to execute instructions of an instruction set (see p.4, which states “Each core has an arithmetic and logical unit (ALU) that executes arithmetic and logical operations…”);
c) wherein the execution unit is a barrel-threaded execution unit configured to run a plurality of concurrent threads each in a different respective one of a repeating sequence of interleaved time slots (from section 6.5.2, “Each warp has 32 threads.  Warps are interleaved and executed by the scheduler.”  A warp, as is known, is a group of concurrently executing threads.  So, a first warp of threads will execute concurrently in a first time slot, followed by a second warp of threads that execute in a second time slot, and so on), and for each of the concurrent threads, the one or more register files comprise a respective set of context registers arranged to hold a program state of the respective thread (again, see p.5, 2nd line under Table 1, and p.6, last full sentence.  Each thread has its own private set of registers to store its context), each set of context registers comprising a respective set of arithmetic operand registers for use by the respective thread (from section 7.1, “The destination registers used by other instructions of the kernel, like arithmetic and logical instructions, are still the normal R-type registers.”);
d) wherein one of the one or more register files further comprises a set of shared weights registers configured to hold weights common to some or all of the concurrent threads (from the bottom of p.1, “When a neuron calculates its output, it reuses a part of the adjacent neuron’s input data. In addition, the weights assigned to the inputs of all neurons are the same throughout the layer. Hence input data and weights are often reused and shared across neurons”.  From p.2, “To the best of my knowledge, this is the first study that exploits data sharing across neurons in the micro-architecture level of GPUs. Instead of redundant memory accesses to the slow system memories, I exploited underutilized register file space to maintain data that could be shared. By using a simple register mapping algorithm, neurons can fetch data from the register file if the data have been already fetched by another neuron.”.  Finally, from section 7.1, a group of L registers are shared registers that hold weights shared across multiple threads.  Thus, these L weights registers for a set of shared weights registers);
e) wherein a first one of the concurrent threads and a second one of the concurrent threads both access the shared weights registers and are executed in different time slots in different execution cycles (recall that threads in different warps are interleaved.  So if each warp has 32 threads, then up to 32 threads in warp 1 will execute in a time slot in a first cycle, then up to 32 threads in warp 2 will executed in a time slot in a second cycle, and so on.  Eventually, warp 1 threads will execute again in a time slot in an Nth cycle, etc.  Threads in the same warp are concurrent because they execute at the same time, as is known, but because they are interleaved with other warps, the threads in warp 1 also execute in different time slots in different cycles.  For instance, assuming just two warps for simplicity, concurrent warp 1 threads will execute ;
f) wherein the instruction set includes an arithmetic instruction having operands specifying a source and a destination from amongst the respective set of arithmetic operand registers of the thread in which the arithmetic instruction is executed (again, from section 7.1.  Arithmetic instructions specify at least one source and destination from registers other than the L weights registers.  For instance, to perform the arithmetic described in section 2.2.2.3, source and destination registers must be specified in addition to a weights register.  The source and destination registers are the arithmetic operand registers); and
g) wherein the execution unit is configured so as, in response to an opcode of the arithmetic instruction, to perform a multiplication operation comprising multiplying an input from said source by at least one of the weights from at least one of the shared weights registers, and to place a result in said destination (see section 2.2.2.3).
Referring to claim 3, Elango has taught the processor of claim 1, wherein the arithmetic instruction takes a further operand specifying said at least one of the shared weights registers from amongst the set of shared weights registers (weights may be stored in the odd registers as shown in FIG.19.  As such, at least one of these odd registers must be indicated by an operand in an instruction).
Referring to claim 4, Elango has taught the processor of claim 1, wherein the input comprises a vector, and the multiplication operation comprises a dot product of the input with a vector of weights from the shared weights registers (see section 2.2.2.3.  The operation shown is a dot product of a pixel input vector and a weight filter vector).
Referring to claim 5, Elango has taught the processor of claim 1, wherein the arithmetic instruction takes a further operand specifying said at least one of the shared weights registers from amongst the set of shared weights registers (see sections 2.2.2.3 and 2.2.2.4 and note that weights are accessed for arithmetic instructions (adds and multiplication, or dot product.  Per section 7.1, weights are stored in L registers and thus they are identified by an operand of the arithmetic instructions performing the aforementioned math), and wherein said at least one of the shared weights registers comprises a subset of the shared weights registers from amongst a plurality of subsets, each subset holding a respective weights vector; and wherein said further operand selects from which subset to take the weights vector to use in said multiplication operation (see FIG.19.  There are multiple weight vector subsets.  And, as little as one register may be a subset of the total weight registers).
Referring to claim 6, Elango has taught the processor of claim 1, wherein said arithmetic instruction is an item selected from a list consisting of: a vector dot product instruction (see section 2.2.2.3), an accumulating vector dot product instruction, a matrix product instruction, an accumulating matrix product instruction, and a convolution instruction (see section 2.2.2.3).
Claim 18 is rejected for similar reasons as claim 1, mutatis mutandis.  Note that each set of registers for a respective thread is a register file (physical and/or virtual).  Thus, there are multiple register files.
Claims 20-23 are respectively rejected for similar reasons as claims 3-6, mutatis mutandis.
Claim 27 is mostly rejected for reasons set forth in the rejection of claim 1.  Further, note that each thread performs the dot product shown in section 2.2.2.3 on its respected portion of the image (in response to an inherent arithmetic instruction).  To carry this out, shared weights making up the weight filter would be used by each thread.  As such, Elango also anticipates the first and second arithmetic instructions based on first and second weights in shared weights registers to generate results to be stored in respective first and second destination registers.
Referring to claim 28, Elango has taught the method of claim 27, wherein executing the first arithmetic instruction comprises: running a program comprising the first arithmetic instruction on the processor through the execution unit (see section 6.2, first sentence, and section 7.1, and note the arithmetic instructions.  Instructions are inherently part of a program run through an execution unit of a processor).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective 

Claims 2 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Elango in view of the examiner’s taking of Official Notice.
Referring to claim 2, Elango has taught the processor of claim 1, but has not taught wherein said at least one of the shared weights registers is implicit from the opcode of the arithmetic instruction, not specified by any operand of the arithmetic instruction. However, implicit operands associated with an opcode, and not specified by a programmer are known in the art.  If an instruction does not include an explicit operand field, then the bits that would be used for such a field would be freed for some other purpose, if desired by a designed.  Further, a low-level programmer would not have to worry about specifying an operand, which could save time.  This is a known technique that could be used to improve Elango while also yielding predictable results.  That is, an operand field could be eliminated from an instruction in Elango, and the instruction configured to automatically access a predetermined operand with the needed weight.  As a result, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Elango such that said at least one of the shared weights registers is implicit from the opcode of the arithmetic instruction, not specified by any operand of the arithmetic instruction.
Claim 19 is rejected for similar reasons as claim 2.

Claims 7, 14, and 24 are rejected under 35 U.S.C. 103 as being unpatentable over Elango in view of Wikipedia, “Thread pool”, June 2017, 4 pages.
Referring to claim 7, Elango has taught the processor of claim 1, wherein the concurrent threads comprise a plurality of worker threads (from p.1 “hundreds or even  the execution unit is further arranged to run, at least at some times, a supervisor subprogram comprising at least one supervisor thread configured to manage the worker threads.  However, Wikipedia has taught a supervisor program that executed to allocate tasks to concurrent threads (see the first paragraph).  This is one known way of scheduling and managing threads that also happens to increase performance and reduces latency by fine tuning the number of threads to create.  As a result, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Elango such that the execution unit is further arranged to run, at least at some times, a supervisor subprogram comprising at least one supervisor thread configured to manage the worker threads.
Referring to claim 14, Elango, as modified, has taught the processor of claim 7, wherein the sets of context registers include a separate arithmetic register file for each concurrent worker thread, the separate arithmetic register file of a given worker thread comprising the respective set of arithmetic operand registers of the given worker thread (again, see p.5, 2nd line under Table 1, and p.6, last full sentence).
Claim 24 is rejected for similar reasons as claim 7.

Claims 7-9, 14, and 24-26 are rejected under 35 U.S.C. 103 as being unpatentable over Elango in view of Nickolls et al., U.S. Patent Application Publication No. 2008/0184211 A1 (herein referred to as Nickolls).
Referring to claim 7, Elango has taught the processor of claim 1, wherein the concurrent threads comprise a plurality of worker threads (from p.1 “hundreds or even thousands of neurons are processed concurrently by employing that many hardware threads, where individual threads deal with a neuron’s work”).  Elango has not taught that the execution unit is further arranged to run, at least at some times, a supervisor subprogram comprising at least one supervisor thread configured to manage the worker threads.  However, Nickolls has taught a program that creates and initializes state for concurrent threads using commands in FIG.10.  See paragraphs [0177]-[0179], among others.  This allows a system to create the desired number of threads and as well as other parameters, for increased flexibility.  As a result, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Elango such that the execution unit is further arranged to run, at least at some times, a supervisor subprogram comprising at least one supervisor thread configured to manage the worker threads.
Referring to claim 8, Elango, as modified, has taught the processor of claim 7, but has not taught wherein the supervisor subprogram is configured to write the weights in the shared weights registers file.  However, Nickolls has taught that the supervisor program can write various parameters.  See paragraph [0104], for instance.  The examiner asserts that it does not matter which threads load the state, so long as the threads can use the state to perform the appropriate operation.  Having the supervisor do the loading is one of a small number of predictable solutions that would have allowed for a reasonable expectation of success.  As a result, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to further modify Elango such that the supervisor subprogram is configured to write the weights in the shared weights registers file.
Referring to claim 9, Elango, as modified, has taught the processor of claim 8, configured such that the weights in the shared weights registers can be written only by the supervisor subprogram, and the worker threads can only read the shared weights registers (see Nickolls, paragraph [0104]).
Referring to claim 14, Elango, as modified, has taught the processor of claim 7, wherein the sets of context registers include a separate arithmetic register file for each concurrent worker thread, the separate arithmetic register file comprising the respective arithmetic operand registers (again, see p.5, 2nd line under Table 1, and p.6, last full sentence).
Claim 24 is rejected for similar reasons as claim 7.
Claim 25 is rejected for similar reasons as claims 7-8.
Claim 26 is rejected for similar reasons as claims 7 and 9.

Claims 10-11 are rejected under 35 U.S.C. 103 as being unpatentable over Elango in view of Wikipedia (or Nickolls) and the examiner’s taking of Official Notice.
Referring to claim 10, Elango, as modified, has taught the processor of claim 7, wherein the sets of context registers comprise a respective one of the sets of context registers for each of the worker threads that can be executed concurrently (again, see Elango, p.5, 2nd line under Table 1, and p.6, last full sentence.  Each set of registers is a context for that thread).  Elango has not taught an additional set of context registers arranged to hold a program state of the supervisor subprogram.  However, it is known practice to assign a context to each separate thread so that there is no corruption of one thread’s context by another thread.  This would increase security and independence and allow the main supervisor thread to not have its data overwritten or accessed by another thread.  As such, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Elango an additional set of context registers arranged to hold a program state of the supervisor subprogram.
Referring to claim 11, Elango, as modified, has taught the processor of claim 10, but has not taught wherein the supervisor subprogram is arranged to begin by initially running in all the slots, and to write the weights before launching the worker threads; and wherein the supervisor subprogram launches each of the worker threads by relinquishing each of some or all of the slots in which the supervisor subprogram is initially running to respective ones of the worker threads.  However, before the worker threads are started, the supervisor program would be the only thread to be executed.  In order to maximize utilization of parallel resources and increase execution speed, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Elango such that the supervisor subprogram is arranged to begin by initially running in all the slots, and to write the weights before launching the worker threads; and wherein the supervisor subprogram launches each of the worker threads by relinquishing each of some or all of the slots in which the supervisor subprogram is initially running to respective ones of the worker threads.  This would allow multiple supervisory tasks to be performed sooner.

Claims 12-13 are rejected under 35 U.S.C. 103 as being unpatentable over Elango in view of Nickolls and the examiner’s taking of Official Notice.
Referring to claim 12, Elango, as modified, has taught the processor of claim 11, wherein the instruction set includes a run instruction which, when executed as part of the supervisor subprogram, causes the slot in which the run instruction is executed to be relinquished to a first worker thread such that the first worker thread is launched in that slot in place of the supervisor subprogram (see FIG.10 of Nickolls, and note “launchCTA”, which is executed in an inherent slot, and results in threads being executed in that, and other slots).
Referring to claim 13, Elango, as modified, has taught the processor of claim 12, wherein the instruction set includes an exit instruction which, when executed as part of the first worker thread, causes the slot in which the exit instruction is executed to be handed back to the supervisor subprogram such that the supervisor subprogram continues running in that slot again in place of the first worker thread (see paragraph [00150] of Nickolls.  A thread in a slot executes a trap instruction.  The trap causes the thread to stop and the slot is used to execute a trap handler, which is considered part of the supervisor).

Claims 15-16 are rejected under 35 U.S.C. 103 as being unpatentable over Elango in view of Wikipedia (or Nickolls) and Wang et al., U.S. Patent Application Publication No. 2015/0160981 A1 (as cited by applicant and herein referred to as Wang).
Referring to claim 15, Elango, as modified, has taught the processor of claim 14, but has not taught wherein the sets of context registers include a separate weights register file comprising the weights registers.  However, Wang has taught that shared data among threads may be stored in a global register file 110.  This is one predictable solution that would have been obvious to try and that would yield a predictable result with an expectation of success.  This shared register file would allow all weights to be stored in a central location so that they don’t need to be duplicated for the threads.  As a result, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Elango such that the sets of context registers include a separate weights register file comprising the weights registers.
Referring to claim 16, Elango, as modified, has taught the processor of claim 15, but has not taught wherein the weights register file is arranged such that it can be written only by the supervisor subprogram and the worker threads can only read the weights register file.  However, Nickolls has taught that constant values may be stored in a global memory and may be configured to be read-only (see paragraph [0105]).  From FIGs.4-5 and section 2.2.2.4, a single feature map shares the same weight values (they are constant).  As such, they do not need to be modified and can be deemed constants for the given feature map.  This ensures they cannot be overwritten and that they are only stored once for the threads (in shared form), reducing redundant storage.  Consequently, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Elango such that the weights register file is arranged such that it can be written only by the supervisor subprogram and the worker threads can only read the weights register file.

Response to Arguments
On page 13 of applicant’s response, applicant argues that Elango’s reference to inter-warp data sharing is ambiguous and does not teach threads executing in different clock cycles would both access a common set of shared weights registers.
After further consideration, and as explained in the rejection of claim 1 above, Elango anticipates this multiple ways: (1) threads in the same warp will execute in different time slots and cycles (due to interleaving).  Threads in the same warp share weights in shared weight registers; and (2) threads in two different warps are still concurrent (in progress at the same time may not be sharing a weight, each of these threads still accesses shared registers to share weights with other threads in their respective warp.  The collection of all weight registers in the system that hold a shared weight makes up the set of shared weights registers.  Furthermore, while applicant points out the lack of detail in Elango with respect to inter-warp sharing, the examiner believes that this may be obvious based on at least one of the KSR rationales set forth in MPEP 2143 (though this is subject to further consideration).  That is, the intra-warp data sharing is already detailed By Elango.  Elango also ponders inter-warp sharing and even states that it can be done with some modifications.  One would be motivated to share across warps to even further reduce redundant load operations and associated data retrieval times.  That is, inter-warp sharing would be useful for the same reason intra-warp sharing would be useful.  The examiner wants to point this out in case applicant amends the claims such that Elango no longer anticipates the claims based on intra-warp sharing. 

On page 14 of applicant’s response, applicant argues that Nickolls sets forth read-only parameters and does not disclose that these parameters are in registers.  Applicant further states that the parameters are for setting up the threads, not for consumption by the threads.
The examiner notes that Elango has already taught weight parameters in registers.  Nickolls is brought in only to teach that a supervisor can load these parameters instead of a worker thread itself.  The examiner does not see a patentable difference regarding which software is performing the loading, as long as the loading is performed.  The examiner is not aware of Nickolls’ teaching that the parameters are not consumed by threads.  FIG.7 and 

Conclusion
The following prior art made of record and not relied upon is considered pertinent to applicant's disclosure:
Hedge et al., UCNN: Exploiting Computational Reuse in Deep Neural Networks via Weight Repetition, April 18, 2018, pp.1-14, has taught an accelerator to take advantage of repeating weights in neural net applications.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to David J. Huisman whose telephone number is 571-272-4168.  The examiner can normally be reached on Monday-Friday, 9:00 am-5:30 pm.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jyoti Mehta, can be reached on 571-270-3995.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov.  Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).  If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/David J. Huisman/Primary Examiner, Art Unit 2183