DETAILED ACTION
Claims 1-2, 4-7, 9-12, and 14-15 have been examined.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Claim Objections
All claims that recite “graphics processing unit” (or units) are objected to because it appears use thereof seems to conflict with the disclosure.  That is, in the portion bolded by applicant in paragraph [0131] on page 9 of the response, applicant discloses “processing elements belonging to different execution units”.  The independent claims, however, recite “different processing elements are located in different graphics processing units”.  In the claims, where applicant recites “graphics processing units”, is applicant referring to execution units within a graphics processing unit?  For instance, see FIG.16, which seems to show a single graphics processing unit 1600 having multiple execution units 1610, each having multiple processing elements 1642.  While one might be able to broadly call an execution unit a graphics processing unit, this is confusing because applicant actually shows a component explicitly called a “graphics processing unit” as containing execution units, not being an execution unit.  If applicant means “execution units”, please use this terminology instead.  Please clarify/correct.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 4-7, and 9-10 are rejected under 35 U.S.C. 103 as being unpatentable over Fung et al., “Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow”, 40th IEEE/ACM International Symposium on Microarchitecture, 2007, pp.407-418 (herein referred to as Fung), in view of Paul et al., U.S. Patent Application Publication No. 2014/0143565 A1 (herein referred to as Paul).
Referring to claim 1, Fung has taught a graphics processor (see FIG.2) comprising:
a) a plurality of graphics processing cores (see FIG.2, and note the graphics shader cores);
b) While Fung has taught that each core has an instruction cache to receive a stream of instructions (see FIG.2, I-Cache), Fung has not taught that the plurality of graphics processing cores are communicably coupled to the instruction cache.  However, Paul, in FIG.1, has taught graphics cores 108 to 110 that not only include their own cache 112 to 114, but also include a shared L2 cache 116 for storing instructions (see paragraph [0024]).  A memory hierarchy has known advantages in the art, including to balance access speed of instructions and cost (memories closer to the cores are usually smaller and more expensive).  Therefore, by including a shared L2 instruction cache in Fung, the system would be able to more quickly access instructions that cannot fit into the individual core caches.  This is beneficial because without the L2 cache, those instruction would be retrieved much more slowly from slower main memory.  As a result, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Fung such that plurality of graphics processing cores are communicably coupled to the instruction cache.
c) Fung has further taught the graphics processing cores comprising:
	c1) a plurality of graphics processing units (see the pipelines of the cores in FIG.2) comprising a plurality of processing elements (there are inherent processing elements in each pipeline to carry out various functions);
c2) a thread control circuit to:
c2A) receive an instruction set of the stream of instructions for execution on at least two graphics processing units of the plurality of graphics processing units (see FIG.9.  Each warp (Wn) (which includes a group of threads - see the abstract) includes a set of instructions from up to four threads that correspond to the four graphics processing units (pipelines) in FIG.2.  For example, warp W0 includes a set of instructions to execute on at least two of the units); and
c2B) determine whether the instruction set requires data dependent addressing (FIG.9 sets forth operation in response to a conditional branch.  A conditional branch performs addressing of a target instruction as a result of its dependence on a data input (see the paragraph before section 3.1 - “input-dependent branch”).  As an example, in FIG.9, the branch at the diverge point is data dependent on some input.  If the input does satisfies some condition, then data-dependent addressing of the target (e.g. code B) is required.  Otherwise, data-dependent addressing of the target is not required and code A is executed); and
d) Fung has further taught a scheduler communicably coupled to the plurality of graphics processing units and the thread control circuit (see FIG.4.  The scheduler is coupled to the processing units (on right), and some control circuit which sends threads to the scheduler (on left)), the scheduler to select between a synchronized execution environment for the at least two graphics processing units and an unsynchronized execution environment for the at least two graphics processing units based at least in part on the determination whether the instruction set requires data dependent addressing, wherein the synchronized execution environment is selected in response to the determination that the instruction set does not require data dependent addressing, and wherein the unsynchronized execution environment is selected in response to the determination that the instruction set requires data dependent addressing; and in response to selection of the synchronized execution environment, synchronize instruction transmission by sending a first instruction in the instruction set to the at least two graphics processing units for synchronous execution in parallel by different processing elements of the plurality of processing elements, wherein the different processing elements are located in different graphics processing units of the at least two graphics processing units (see FIG.9.  For threads in a warp that don’t require data-dependent addressing of the target and execute the fall-through code (e.g. code A), a synchronized environment is selected and the same instruction is sent to the number of units required by those threads.  For instance, in W0, the top two threads (solid arrowheads) fall through to the same instruction(s) (code A), and code A is synchronously sent to different processing elements in two different pipelines (graphics processing units) for parallel execution.  For the remaining threads that require data-dependent addressing and jump to the branch target (e.g. code B), an unsynchronized environment is selected because instead of synchronously transmitting the bottom two threads (in W0) with the upper two threads, they execute serially with respect to the threads executing code A (i.e., W0-B follows W0-A)).
Referring to claim 2, Fung, as modified, has taught the graphics processor of claim 1, the thread control circuit to:
a) determine that the instruction set does not require data dependent addressing (the component that determines that the threads diverge at the branch point is the thread dispatcher); and
b) wherein the scheduler, in response to the determination that the instruction set does not require data dependent addressing, is to: synchronize transmission of the instruction set on a data bus coupled to the at least two graphics processing execution units (again, see the rejection of claim 1.  Any threads that don’t require data-dependent addressing execute the fall-through code (e.g. code A).  As such, code A is synchronously transmitted on a data bus to the graphics units to execute code A at the same time (see W0-A)).
Claim 4 is rejected for reasoning set forth in the rejection of claim 1.  That is, if data-dependent addressing is required for at least one thread, then there will be some lack of synchronization because W0-A and W0-B will have to execute serially in the unsynchronized environment.
Claim 5 is rejected for reasoning set forth in the rejection of claim 1.  Again, the solids in W0 are executed serially with respect to the empties in W0.
Claims 6-7 and 9-10 are respectively rejected for similar reasons as claims 1-2 and 4-5.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.

Claims 11-12 and 14-15 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Fung.
Claims 11-12 and 14-15 are respectively rejected for similar reasoning set forth in the rejections of claims 1-2 and 4-5.  However, as claims 11-12 and 14-15 do not require the “instruction cache” limitations of claim 1, claims 11-12 and 14-15 are anticipated by Fung (not obvious in view of Fung and Paul).

---------------------------------------------------------------------------------------------------------------------

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2, 4-7, and 9-10 are rejected under 35 U.S.C. 103 as being unpatentable over Lanka et al., U.S. Patent Application Publication No. 2017/0300361 A1 (herein referred to as Lanka), in view of Paul.
Referring to claim 1, Lanka has taught a graphics processor (e.g. FIG.8) comprising:
a) a plurality of graphics processing cores (FIG.8, cores 880)
b) Lanka has further taught the graphics processing cores comprising:
	b1) a plurality of graphics processing units comprising a plurality of processing elements (see FIG.8-9, graphics processing units 850 that include processing elements (EUs).  Alternatively, each collection of EUs 852, 862 may be a graphics processing unit.  Each collection includes individual processing elements (EUs).  Alternatively, a processing element may be any portion of logic that assists in carrying out execution);
b2) a thread control circuit (at least FIG.9, 904) to:
b2A) receive an instruction set of a stream of instructions for execution on at least two graphics processing units of the plurality of graphics processing units (see FIG.2B and paragraphs [0045]-[0051].  An instruction set (at least two of commands 1-6 of FIG.2B) is received.  As instructions may be executed in parallel (paragraphs [0045] and [0049]), they would inherently require two graphics processing units.  Note that “at least two graphics processing units” are not required to be at the same level of granularity as described in part (b1) above); and
b2B) determine whether the instruction set requires data dependent addressing (the examiner notes that applicant has not explained what data-dependent addressing is.  In FIG.2B, dependencies between commands are shown.  Instructions access/address registers (for inputs and outputs) (e.g. paragraph [0019]).  Thus, any addressing that dependent commands 3, 5, and 6 perform, is dependent on a notification data (e.g. for CMD3 to proceed, notification data corresponding to NOTIFY 1 must be received)); and
c) a scheduler (at least FIG.2B) communicably coupled to the plurality of graphics processing units and the thread control circuit, the scheduler to select between a synchronized execution environment for the at least two graphics processing units and an unsynchronized execution environment for the at least two graphics processing units based at least in part on the determination whether the instruction set requires data dependent addressing, wherein the synchronized execution environment is selected in response to the determination that the instruction set does not require data dependent addressing, and wherein the unsynchronized execution environment is selected in response to the determination that the instruction set requires data dependent addressing; and in response to selection of the synchronized execution environment, synchronize instruction transmission by sending a first instruction in the instruction set to the at least two graphics processing units for synchronous execution in parallel by different processing elements of the plurality of processing elements, wherein the different processing elements are located in different graphics processing elements of the at least two graphics processing units (see FIG.2B and paragraphs [0045]-[0051].  As instructions 1, 2, and 4 are independent (require no data-dependent addressing), a synchronized environment is selected in which they are executed in parallel, inherently by different processing elements in different graphics elements.  As the graphics instruction set includes a SIMD instruction set (paragraph [0100]), executing multiple SIMD instructions in parallel requires multiple different processing elements in different graphics processing units (as interpreted above).  However, for commands 3, 6, and 5, which include data-dependent addressing, an unsynchronized environment is selected that involves non-parallel transmission/execution.  For example, instructions 3 and 6 are transmitted/executed after 1, 2, and 4.  And, instruction 5 is transmitted/executed after instructions 3 and 6).
d) Lanka has not taught that the plurality of graphics processing cores are communicably coupled to an instruction cache to receive the stream of instructions.  However, Paul, in FIG.1, has taught graphics cores 108 to 110 that not only include their own cache 112 to 114, but also include a shared L2 cache 116 for storing instructions (see paragraph [0024]).  A memory hierarchy has known advantages in the art, including to balance access speed of instructions and cost (memories closer to the cores are usually smaller and more expensive).  Therefore, by including a shared L2 instruction cache in Lanka, the system would be able to more quickly access instructions that cannot fit into the individual caches (e.g. FIG.9, 906).  This is beneficial because without the shared L2 cache, those instruction would be retrieved much more slowly from slower main memory.  As a result, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify Lanka such that the plurality of graphics processing cores are communicably coupled to an instruction cache to receive the stream of instructions.
Referring to claim 2, Lanka, as modified, has taught the graphics processor of claim 1, the thread control circuit to:
a) determine that the instruction set does not require data dependent addressing (such is the case for instructions 1, 2, and 4 in FIG.2B); and
b) wherein the scheduler, in response to the determination that the instruction set does not require data dependent addressing, is to: synchronize transmission of the instruction set on a data bus coupled to the at least two graphics processing execution units (again, the instructions are transmitted/executed in parallel per paragraph [0049]).
Claim 4 is rejected for reasoning set forth in the rejection of claim 1.  That is, if data-dependent addressing is required, then instructions will execute serially in the unsynchronized environment.
Claim 5 is rejected for reasoning set forth in the rejection of claim 1.  Again, CMDs 3 and 6 are executed in parallel, but serially with respect to CMDs 1, 2, 4, and 5.
Claims 6-7 and 9-10 are respectively rejected for similar reasons as claims 1-2 and 4-5.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.

Claims 11-12 and 14-15 are rejected under 35 U.S.C. 102(a)(1) as being anticipated by Lanka.
Claims 11-12 and 14-15 are respectively rejected for similar reasoning set forth in the rejections of claims 1-2 and 4-5.  However, as claims 11-12 and 14-15 do not require the “instruction cache” limitations of claim 1, claims 11-12 and 14-15 are anticipated by Lanka (not obvious in view of Lanka and Paul).

Response to Arguments
On page 10 of applicant’s response, applicant argues that Fung does not teach the processing elements being located in different processing units.
The examiner respectfully disagrees.  Recall that the graphics processing units in Fung correspond to the four pipelines shown in a shader core in FIG.2.  Since there are four pipelines, four threads may execute the same instruction in parallel (FIG.9).  Thus, from FIG.9, when two threads go down the same branch, e.g. to code B, then the same instruction in code B will be sent to two of the pipelines (graphics processing units) for execution.  This would correspond to warp W0-B, for instance, in FIG.9.

The examiner notes that applicant has not addressed the 103 rejection based on Lanka.  As such, this rejection is maintained.

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to David J. Huisman whose telephone number is 571-272-4168.  The examiner can normally be reached on Monday-Friday, 9:00 am-5:30 pm.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jyoti Mehta, can be reached at 571-270-3995.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov.  Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).  If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/David J. Huisman/Primary Examiner, Art Unit 2183