DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Terminal Disclaimer
The terminal disclaimer filed on 10/06/2022 disclaiming the terminal portion of any patent granted on this application which would extend beyond the expiration date of US Patent No. 9727944 has been reviewed and is accepted.  The terminal disclaimer has been recorded.

Response to Amendment
 	This is in response to applicant’s amendment/response filed on 10/06/2022, which has been entered and made of record.    Claims 21-40 are pending in the application. 

Response to Arguments
 	Applicant's arguments filed on 10/06/2022 have been fully considered but they are not persuasive.
Applicants state that “White is silent regarding storage 800, reorder buffer 32, or instruction cache 16 storing any hazard information. Applicant submits that White therefore cannot teach or suggest "instruction cache circuitry configured to store ... hazard information" as recited in claim 21”.

    PNG
    media_image1.png
    338
    349
    media_image1.png
    Greyscale
	The examiner disagrees. White teaches instruction cache circuitry (Fig 2, reorder buffer) configured to store: decoded instructions; and hazard information generated by the hazard circuitry (abstract, “in a superscalar processor may be a reorder buffer which stores information corresponding to concurrently dispatched instructions”, col 2:55-67 and col 3: 1-7, “the reservation stations are coupled to receive and store operand information and the above mentioned indication.  In addition, the reservation stations are configured to update the stored operand information with a tag of the above mentioned first instruction, in response to detecting the indication of a dependency …. A plurality of instructions are concurrently decoded and operand request information is obtained from the decoded instructions.  A dependency of a second instruction of the plurality of instructions on a first instruction of the plurality of instructions is detected, where the second instruction is subsequent to the first instruction in program order.  Operand information corresponding to instructions prior to the plurality of instructions is stored for the second instruction in a reservation station.  The stored operand information is updated with a tag of the first instruction in response to detecting the above dependency”; Fig 2, col 10:25-53, “Also illustrated in FIG. 2 is the storage of instruction information by reorder buffer 32.  Instruction information storage 800 may comprise lines of storage 320 where each line is configured to store information pertaining to up to three simultaneously issued instructions, I0-I2.  In addition, instruction information storage may store shared information, SH, which is associated with the line of storage.  When instructions are issued from decode units 20A-20C, instruction information is conveyed to instruction information storage 800 and operand requests for the instructions are conveyed to future file 140 and intra-line dependency checking logic 330”).
Applicants state that “(A) Luick's I-lines do not include hazard/dependency information and (B) Luick's issue groups are not cached”.

    PNG
    media_image2.png
    283
    722
    media_image2.png
    Greyscale

The examiner disagrees. Luick teaches (A) Luick's I-lines do not include hazard/dependency information and (B) Luick's issue groups are not cached (see Figs 1-2, par 0034, “The L1 cache 116 depicted in FIG. 1 may be divided into two parts, an L1 instruction cache 222 (I-cache 222) for storing I-lines as well as an L1 data cache 224 (D-cache 224) for storing D-lines “, par 0038, “ The issue and dispatch circuitry 234 may also include circuitry to rotate and merge instructions in the I-line and thereby form an appropriate instruction group. Formation of issue groups may take into account several considerations, such as dependencies between the instructions in an issue group as well as optimizations which may be achieved from the ordering of instructions as described in greater detail below”)
Applicants state that “Applicant also submits that the references cannot teach or suggest "execution circuitry configured to perform the following operations multiple times while a set of instructions is stored in the cache circuitry: access the set of instructions from the cache circuitry; receive at least a portion of the hazard information from the cache circuitry, wherein 11 of 13 the portion corresponds to the set of instructions; and execute the set of instructions according to the received hazard information" as recited in claim 21”.
The examiner disagrees.  Applicant did not raise any specific argument or evidence to support his conclusion.  The Examiner directs Applicant to claim rejections for detailed analyses.  

Claim Rejections - 35 USC § 103
 	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

 	Claims 21, 24, 31-32, 34, and 36 is/are rejected under 35 U.S.C. 103 as being unpatentable over U.S. Patent 6542986 to White in view of U.S. PGPubs 2008/0141253 to Luick.

    PNG
    media_image3.png
    344
    449
    media_image3.png
    Greyscale

	Regarding claim 21, White teaches an apparatus (abstract), comprising: 
decode circuitry (Figs 1-2, col 3:56-67, col 6:7-26, decode unites) configured to decode instructions (Figs 1-2, col 3:56-67, col 6:7-26, col 9:56-67 and col 10:1-10, “Decode units 20A-20C decode instructions and convey operand requests to future file 140 and intra-line dependency checking logic 330 on buses 50A-50C respectively”); 
hazard circuitry (Fig 2, col 9:56-64, col 9:25-53, “intra-line dependency checking logic 330”) configured to generate hazard information that specifies dependencies between instructions of the decoded instructions (col 2:42-60, “The decode units are configured to concurrently decode instructions and are configured to convey operand request information.  The reorder buffer is configured to receive operand request information and convey operand information responsive to a dependency check on instructions prior in program order to the concurrently decoded instructions.  The reorder buffer is further configured to detect and convey an indication of a dependency of a second instruction of concurrently received instructions on a first instruction of concurrently received instructions, where the second instruction is subsequent to the first instruction in program order.  Finally, the reservation stations are coupled to receive and store operand information and the above mentioned indication.  In addition, the reservation stations are configured to update the stored operand information with a tag of the above mentioned first instruction, in response to detecting the indication of a dependency”; Fig 2, col 10:25-53, “If intra-line dependency checking logic 330 detects another instruction within the same dispatch line has a destination operand which is the same as the requested source operand and the other instruction is prior in program order to the instruction requesting the source operand, intra-line dependency checking logic 330 conveys an indication upon bus 310 that an intra-line dependency exists and the position within the dispatch line of the dependency” ….. White disclose a dependency check logic (hazard detection) is used to detect the dependence of a second instruction of concurrently received instructions on a first instruction of concurrently received instructions); 
instruction cache circuitry (Fig 2, reorder buffer) configured to store: decoded instructions; and hazard information generated by the hazard circuitry (col 2:55-67 and col 3: 1-7, “the reservation stations are coupled to receive and store operand information and the above mentioned indication.  In addition, the reservation stations are configured to update the stored operand information with a tag of the above mentioned first instruction, in response to detecting the indication of a dependency …. A plurality of instructions are concurrently decoded and operand request information is obtained from the decoded instructions.  A dependency of a second instruction of the plurality of instructions on a first instruction of the plurality of instructions is detected, where the second instruction is subsequent to the first instruction in program order.  Operand information corresponding to instructions prior to the plurality of instructions is stored for the second instruction in a reservation station.  The stored operand information is updated with a tag of the first instruction in response to detecting the above dependency”; Fig 2, col 10:25-53, “Also illustrated in FIG. 2 is the storage of instruction information by reorder buffer 32.  Instruction information storage 800 may comprise lines of storage 320 where each line is configured to store information pertaining to up to three simultaneously issued instructions, I0-I2.  In addition, instruction information storage may store shared information, SH, which is associated with the line of storage.  When instructions are issued from decode units 20A-20C, instruction information is conveyed to instruction information storage 800 and operand requests for the instructions are conveyed to future file 140 and intra-line dependency checking logic 330”); 
execution circuitry (Fig 2, function units) configured to perform the following operations multiple times while a set of instructions is stored in the cache circuitry (col 4:7-22, “Each decode unit 20A-20C is coupled to load/store unit 26 and to respective reservation stations 22A-22C. Reservation stations 22A-22C are further coupled to respective functional units 24A-24C. Additionally, decode units 20 and reservation stations 22 are coupled to register file 30 and reorder buffer 32. Functional units 24 are coupled to load/store unit 26, register file 30, and reorder buffer 32 as well”, col 6:50-67, “Instructions aligned and dispatched to reservation station 22A are executed by functional unit 24A. Similarly, issue position 1 is formed by reservation station 22B and functional unit 24B; and issue position 2 is formed by reservation station 22C and functional unit 24C “, col 8:3-57, “each of the functional units 24 is configured to perform integer arithmetic operations of addition and subtraction, as well as shifts, rotates, logical operations, and branch operations. The operations are performed in response to the control values decoded for a particular instruction by decode units 20. It is noted that a floating point unit (not shown) may also be employed to accommodate floating point operations. The floating point unit may be operated as a coprocessor, receiving instructions from MROM unit 34 or reorder buffer 32 and subsequently communicating with reorder buffer 32 to complete the instructions. Additionally, functional units 24 may be configured to perform address generation for load and store memory operations performed by load/store unit 26. In one particular embodiment, each functional unit 24 may comprise an address generation unit for generating addresses and an execute unit for performing the remaining functions”, also see Figs 2 and 3, perform functions multiple times): 
access the set of instructions from the cache circuitry (col 10:53-67 and col 11:1-11, “ the reorder buffer tag locating the instruction within reorder buffer 32 is conveyed upon operand data/tag bus 308. One operand data value and one operand tag are provided for each operand of the instruction upon operand data/tag bus 308. As above, validity indicators may be asserted for each data and tag value by reorder buffer 32, such that reservation stations 22 may discern which is being provided for a particular operand (e.g. data or reorder buffer tag). Intra-line dependency checking logic 330 provides an indication of intra-line dependencies upon bus 310 as discussed above”); 
receive at least a portion of the hazard information from the cache circuitry, wherein the portion corresponds to the set of instructions (Fig 2, col 10:25-53, “Also illustrated in FIG. 2 is the storage of instruction information by reorder buffer 32.  Instruction information storage 800 may comprise lines of storage 320 where each line is configured to store information pertaining to up to three simultaneously issued instructions, I0-I2.  In addition, instruction information storage may store shared information, SH, which is associated with the line of storage.  When instructions are issued from decode units 20A-20C, instruction information is conveyed to instruction information storage 800 and operand requests for the instructions are conveyed to future file 140 and intra-line dependency checking logic 330”); 
But White keeps silence for teaching execute the set of instructions according to the received hazard information.

    PNG
    media_image4.png
    477
    346
    media_image4.png
    Greyscale

In related endeavor, Luick teaches execution circuitry (Figs 2-3, core 114) configured to perform the following operations multiple times while a set of instructions is stored in the cache circuitry (par 0025, “executing instructions in a pipelined manner that may reduce stalls that occur when executing dependent instructions. Stalls may be reduced by utilizing a cascaded arrangement of pipelines with execution units that are delayed with respect to each other. This cascaded delayed arrangement allows dependent instructions to be issued within a common issue group by scheduling them for execution in different pipelines to execute at different times”, Figs 2-3, par 0038-0042, “ the issue and dispatch circuitry 234 may be used to form instruction groups and issue the formed instruction groups to the core 114 … The execution unit may also read data from a register file, calculate addresses, perform integer arithmetic functions (e.g., using an arithmetic logic unit, or ALU), perform floating point arithmetic functions, execute instruction branches, perform data access functions (e.g., loads and stores from memory), and store data back to registers (e.g., in the register file 240). In some cases, the core 114 may utilize instruction fetching circuitry 236, the register file 240, cache load and store circuitry 250, and write-back circuitry, as well as any other circuitry, to perform these functions”, par 0063-0064, “The concepts of cascaded, delayed, execution pipeline units presented herein, wherein the execution of one more instructions in an issue group is delayed relative to the execution of another instruction in the same group, may be applied in a variety of different configurations utilizing a variety of different types of functional units. Further, for some embodiments, multiple different configurations of cascaded, delayed, execution pipeline units may be included in the same system and/or on the same chip. The particular configuration or set of configurations included with a particular device or system may depend on the intended use. The fixed point execution pipeline units described above allow issue groups containing relatively simple operations that take only a few cycles to complete, such as load, store, and basic ALU operations to be executed without stalls, despite dependencies within the issue group”): 
access the set of instructions from the cache circuitry (Figs 2-3, par 0037-0038, “the issue and dispatch circuitry 234 may be used to form instruction groups and issue the formed instruction groups to the core 114. The issue and dispatch circuitry 234 may also include circuitry to rotate and merge instructions in the I-line and thereby form an appropriate instruction group”); 
receive at least a portion of the hazard information from the cache circuitry, wherein the portion corresponds to the set of instructions (par 0038, “The issue and dispatch circuitry 234 may also include circuitry to rotate and merge instructions in the I-line and thereby form an appropriate instruction group. Formation of issue groups may take into account several considerations, such as dependencies between the instructions in an issue group as well as optimizations which may be achieved from the ordering of instructions as described in greater detail below. Once an issue group is formed, the issue group may be dispatched in parallel to the processor core 114. In some cases, an instruction group may contain one instruction for each pipeline in the core 114. Optionally, the instruction group may a smaller number of instructions”); and 
execute the set of instructions according to the received hazard information (Fig 3, par 0038-0041, “one or more processor cores 114 may utilize a cascaded, delayed execution pipeline configuration. In the example depicted in FIG. 3, the core 114 contains four pipelines in a cascaded configuration…. each pipeline (P0, P1, P2, P3) in the cascaded, delayed execution pipeline configuration may contain an execution unit 310. The execution unit 310 may contain several pipeline stages which perform one or more functions for a given pipeline. For example, the execution unit 310 may perform all or a portion of the fetching and decoding of an instruction. The decoding performed by the execution unit may be shared with a predecoder and scheduler 220 which is shared among multiple cores 114 or, optionally, which is utilized by a single core 114. The execution unit may also read data from a register file, calculate addresses, perform integer arithmetic functions (e.g., using an arithmetic logic unit, or ALU), perform floating point arithmetic functions, execute instruction branches, perform data access functions (e.g., loads and stores from memory), and store data back to registers (e.g., in the register file 240)…each execution unit 310 may perform the same functions. Optionally, each execution unit 310 (or different groups of execution units) may perform different sets of functions. Also, in some cases the execution units 310 in each core 114 may be the same or different from execution units 310 provided in other cores”, par 0045, “some execution units 310 may be delayed with respect to each other while other execution units 310 are not delayed with respect to each other. Where execution of a second instruction is dependent on the execution of a first instruction, forwarding paths 312 may be used to forward the result from the first instruction to the second instruction”).
It would have been obvious to a person of ordinary skill in the art before the effective filing data of the claimed invention to modify White to include execute the set of instructions according to the received hazard information as taught by Luick to improve mechanism of pipelining instructions, preferably that reduces stalls.

Regarding claim 24, White as modified by Luick teaches all the limitation of claim 21, and Luick further teach wherein instructions of the set of instructions operate on different input data for different executions of the set of instructions (Figs 1-2, 4B, 5, 8, 9A-9D, abstract, par 0025, par 0029-0034, par 0040-0045, par 0052-0053, par 0054-0055, par 0063-0068, par 0073-0075, “This cascaded delayed arrangement allows dependent instructions to be issued within a common issue group by scheduling them for execution in different pipelines to execute at different times”, “instructions may be fetched from the L2 cache 112 in groups, referred to as I-lines”, “each execution unit 310 may perform the same functions.  Optionally, each execution unit 310 (or different groups of execution units) may perform different sets of functions.  Also, in some cases the execution units 310 in each core 114 may be the same or different from execution units 310 provided in other cores”, “the actual scheduling operations may be performed in a predecoder/scheduler circuit shared between multiple processor cores (each having a cascaded-delayed execution pipeline unit), while dispatching/issuing instructions may be performed by separate circuitry within a processor core …. a group of instructions to be issued is received, with the group including a second instruction dependent on a first instruction.  At step 504, the first instruction is scheduled to issue in a first pipeline having a first execution unit.  At step 506, the second instruction is scheduled to issue in a second pipeline having a second execution unit that is delayed relative to the first execution unit ….The exact manner in which instructions are scheduled to different pipelines may vary with different embodiments and may depend, at least in part, on the exact configuration of the corresponding cascaded-delayed pipeline unit”, “The concepts of cascaded, delayed, execution pipeline units presented herein, wherein the execution of one more instructions in an issue group is delayed relative to the execution of another instruction in the same group, may be applied in a variety of different configurations utilizing a variety of different types of functional units. The fixed point execution pipeline units described above allow issue groups containing relatively simple operations that take only a few cycles to complete, such as load, store, and basic ALU operations to be executed without stalls, despite dependencies within the issue group”). This would be obvious for the same reason given in the rejection for claim 21.

Regarding claim 31, White as modified by Luick teaches all the limitation of claim 21, and White further teach wherein the apparatus is a computing device that includes: a processor that includes the hazard circuitry and the instruction cache circuitry; and a display unit (Fig 5, col 14:13-31, “Graphics controller 208 is provided to control the rendering of text and images on a display 226. Graphics controller 208 may embody a typical graphics accelerator generally known in the art to render three-dimensional data structures which can be effectively shifted into and from main memory 204”).

Regarding claim 32, the method claim 32 is similar in scope to claim 21 and is rejected under the same rational.

Regarding claim 34, White as modified by Luick teaches all the limitation of claim 32, the claim 34 is similar in scope to claim 24 and is rejected under the same rational.

Regarding claim 36, the claim 36 is similar in scope to claim 21 and is rejected under the same rational.

Claims 23, 25-26, 29-30, 35, 37, and 40 is/are rejected under 35 U.S.C. 103 as being unpatentable over U.S. Patent 6542986 to White in view of U.S. PGPubs 2008/0141253 to Luick, further in view of U.S. PGPubs 2013/0159628 to Choqutte et al..

Regarding claim 23, White as modified by Luick teaches all the limitation of claim 21, but does not explicitly teach wherein the execution circuitry is single-instruction multiple-data (SIMD) execution circuitry configured to execute the set of instructions for multiple threads in parallel.
In related endeavor, Choquette et al. further teach wherein the execution circuitry is single-instruction multiple-data (SIMD) execution circuitry configured to execute the set of instructions for multiple threads in parallel (Fig 3B, par 0004, par 0041, “SIMD (single instruction, multiple data) architecture processors execute the same instruction on each of the multiple cores where each core executes on different input data”, Figs 2 and 3C, par 0025-0026, par 0059, par 0063-0064, par 0068, a parallel processing system includes a parallel processing subsystem which could execute the instructions in parallel using execution units which could perform datapath).
It would have been obvious to a person of ordinary skill in the art before the effective filing data of the claimed invention to modify White as modified by Luick to include wherein the execution circuitry is single-instruction multiple-data (SIMD) execution circuitry configured to execute the set of instructions for multiple threads in parallel as taught by Choquette et al. to enables two or more threads to execute substantially simultaneously using the resources of a single processing core to  execute the same instruction on each of the multiple cores where each core executes on different input data to improve technique for loading values from the register file into the inputs of a datapath of a processor core to reduce latencies and improve overall processing efficiency.

Regarding claim 25, White as modified by Luick teaches all the limitation of claim 21, but keeps silent for teaching wherein the set of instructions is a clause of instructions that is assigned a clause identifier, wherein once a clause is invoked for execution, the execution circuitry is configured to execute all instructions in the clause.
In related endeavor, Choquette et al. further teach wherein the set of instructions is a clause of instructions that is assigned a clause identifier, wherein once a clause is invoked for execution, the execution circuitry is configured to execute all instructions in the clause (Fig 3C, par 0004, par 0041-0048, “each GPC 208 includes a number M of SMs 310, where M.gtoreq.1, each SM 310 configured to process one or more thread groups.  Also, each SM 310 advantageously includes an identical set of functional execution units (e.g., execution units and load-store units--shown as Exec units 302 and LSUs 303 in FIG. 3C) that may be pipelined, allowing a new instruction to be issued before a previous instruction has finished, as is known in the art.  Any combination of functional execution units may be provided …. Each SM 310 contains a level one (L1) cache (shown in FIG. 3C) or uses space in a corresponding L1 cache outside of the SM 310 that is used to perform load and store operations”; par 0051-0054, “FIG. 3C is a block diagram of the SM 310 of FIG. 3B, according to one embodiment of the present disclosure.  The SM 310 includes an instruction L1 cache 370 that is configured to receive instructions and constants from memory via L1.5 cache 335.  A warp scheduler and instruction unit 312 receives instructions and constants from the instruction L1 cache 370 and controls local register file 304 and SM 310 functional units according to the instructions and constants.  The SM 310 functional units include N exec (execution or processing) units 302 and P load-store units (LSU) 303. …. Each thread in the thread array is assigned a unique thread identifier ("thread ID") that is accessible to the thread during the thread's execution.  The thread ID, which can be defined as a one-dimensional or multi-dimensional numerical value controls various aspects of the thread's processing behavior.  For instance, a thread ID may be used to determine which portion of the input data set a thread is to process and/or to determine which portion of an output data set a thread is to produce or write …. A warp scheduler and instruction unit 312 receives instructions and constants from the instruction L1 cache 370 and controls local register file 304 and SM 310 functional units according to the instructions and constants”; Fig 6, par 0074-0076, “At step 616, SM 310 updates the cache table to reflect the register indices corresponding to register values stored in operand collector 440.  At step 618, SM 310 configures ALU 452 to execute the instruction using the operands stored in the operand collector 440 as inputs to ALU 452“).
It would have been obvious to a person of ordinary skill in the art before the effective filing data of the claimed invention to modify White as modified by Luick to include wherein the set of instructions is a clause of instructions that is assigned a clause identifier, wherein once a clause is invoked for execution, the execution circuitry is configured to execute all instructions in the clause as taught by Choquette et al. to enables two or more threads to execute substantially simultaneously using the resources of a single processing core to  execute the same instruction on each of the multiple cores where each core executes on different input data to improve technique for loading values from the register file into the inputs of a datapath of a processor core to reduce latencies and improve overall processing efficiency.
Regarding claim 26, White as modified by Luick teaches all the limitation of claim 21, but keeps silent for teaching wherein the execution circuitry includes a plurality of data path units configured to receive instructions from the cache circuitry and execute received instructions in parallel with other data path units; and wherein respective ones of the plurality of data path units are configured to execute received instructions in parallel within the data path unit using a plurality of execute modules.
In related endeavor, Choquette et al. further teach wherein the execution circuitry includes a plurality of data path units configured to receive instructions from the cache circuitry (Fig 4, par 0062-0063, “Execution unit 302 implements a datapath for performing an operation.  The datapath includes the operand collector 440, the arithmetic logic unit (ALU) 452, and a result FIFO 454”  …. Choquette et al. disclose Execution unit receives the instruction from memory) and execute received instructions in parallel with other data path units (Figs 2 and 3C, par 0025-0026, par 0059, par 0063-0064, par 0068, a parallel processing system includes a parallel processing subsystem which could execute the instructions in parallel using execution units which could perform datapath); and wherein respective ones of the plurality of data path units are configured to execute received instructions in parallel within the data path unit using a plurality of execute modules (Figs 2 and 3C, par 0025-0026, par 0059, par 0063-0064, par 0068, disclose a parallel processing subsystem could execute the instructions in parallel using Execution unit which could perform datapath).
It would have been obvious to a person of ordinary skill in the art before the effective filing data of the claimed invention to modify White as modified by Luick to include wherein the execution circuitry includes a plurality of data path units configured to receive instructions from the cache circuitry as taught by Choquette et al. to enables two or more threads to execute substantially simultaneously using the resources of a single processing core to  execute the same instruction on each of the multiple cores where each core executes on different input data to improve technique for loading values from the register file into the inputs of a datapath of a processor core to reduce latencies and improve overall processing efficiency.

Regarding claim 29, White as modified by Luick teaches all the limitation of claim 21, but keeps silent for teaching wherein the instruction cache circuitry is configured to store decoded instructions using at least one of the following circuit types: flip-flop circuitry and latch circuitry.
In related endeavor, Choquette et al. further teach wherein the instruction cache circuitry is configured to store decoded instructions using at least one of the following circuit types: flip-flop circuitry and latch circuitry (par 0062-0063, “Operand collector 440 includes a number of storage elements 441-446 that may be coupled to the inputs of the ALU 452.  Each storage element 441-446 may be a flip-flop, latch, or any other technically feasible circuit component capable of temporarily storing a value to supply as an input to the  ALU 452”).
It would have been obvious to a person of ordinary skill in the art before the effective filing data of the claimed invention to modify White as modified by Luick to include wherein the instruction cache circuitry is configured to store decoded instructions using at least one of the following circuit types: flip-flop circuitry and latch circuitry as taught by Choquette et al. to enables two or more threads to execute substantially simultaneously using the resources of a single processing core to  execute the same instruction on each of the multiple cores where each core executes on different input data to improve technique for loading values from the register file into the inputs of a datapath of a processor core to reduce latencies and improve overall processing efficiency.

Regarding claim 30, White as modified by Luick teaches all the limitation of claim 21, but keeps silent for teaching wherein the apparatus is a graphics processor configured to perform pixel rendering tasks and compute tasks.
In related endeavor, Choquette et al. further teach wherein the apparatus is a graphics processor configured to perform pixel rendering tasks and compute tasks (par 0026, “Referring again to FIG. 1 as well as FIG. 2, in some embodiments, some or all of PPUs 202 in parallel processing subsystem 112 are graphics processors with rendering pipelines that can be configured to perform various operations related to generating pixel data from graphics data supplied by CPU 102 and/or system memory 104 via memory bridge 105 and the second communication path 113, interacting with local parallel processing memory 204 (which can be used as graphics memory including, e.g., a conventional frame buffer) to store and update pixel data, delivering pixel data to display device 110, and the like “, par 0031-0032, “GPCs 208 receive processing tasks to be executed from a work distribution unit within a task/work unit 207. The work distribution unit receives pointers to processing tasks that are encoded as task metadata (TMD) and stored in memory  … Render targets, such as frame buffers or texture maps may be stored across DRAMs 220, allowing partition units 215 to write portions of each render target in parallel to efficiently use the available bandwidth of parallel processing memory 204 “, par 0034, “GPCs 208 can be programmed to execute processing tasks relating to a wide variety of applications, including but not limited to, linear and nonlinear data transforms, filtering of video and/or audio data, modeling operations (e.g., applying laws of physics to determine position, velocity and other attributes of objects), image rendering operations (e.g., tessellation shader, vertex shader, geometry shader, and/or pixel shader programs), and so on. PPUs 202 may transfer data from system memory 104 and/or local parallel processing memories 204 into internal (on-chip) memory, process the data, and write result data back to system memory 104 and/or local parallel processing memories 204, where such data can be accessed by other system components, including CPU 102 or another parallel processing subsystem 112”, par 0038-0040, “FIG. 3A is a block diagram of the task/work unit 207 of FIG. 2, according to one embodiment of the present disclosure. The task/work unit 207 includes a task management unit 300 and the work distribution unit 340. The task management unit 300 organizes tasks to be scheduled based on execution priority levels”). 
It would have been obvious to a person of ordinary skill in the art before the effective filing data of the claimed invention to modify White as modified by Luick to include wherein the apparatus is a graphics processor configured to perform pixel rendering tasks and compute tasks as taught by Choquette et al. to enables two or more threads to execute substantially simultaneously using the resources of a single processing core to  execute the same instruction on each of the multiple cores where each core executes on different input data to efficiently use the available bandwidth of parallel processing memory to perform graphics render tasks to display.

Regarding claim 35, White as modified by Luick teaches teach all the limitation of claim 32, the claim 35 is similar in scope to claim 25 and is rejected under the same rational.

Regarding claims 37 and 40, White as modified by Luick teaches teach all the limitation of claim 36, the claims 37 and 40 similar in scope to claims 25 and 29 and are rejected under the same rational.

Claims 22, 27, 33, and 38 is/are rejected under 35 U.S.C. 103 as being unpatentable over U.S. Patent 6542986 to White in view of U.S. PGPubs 2008/0141253 to Luick, further in view of U.S. PGPubs 2013/0173886 to Dockser et al..

Regarding claim 22, White as modified by Luick teaches all the limitation of claim 21, but keeps silent for teaching wherein the hazard circuitry is configured to compare one or more operand locations of a received instruction with one or more operand locations of a threshold number of previously-received instructions to generate the hazard information.
In related endeavor, Dockser et al. further teach wherein the hazard circuitry is configured to compare one or more operand locations of a received instruction with one or more operand locations of a threshold number of previously-received instructions to generate the hazard information (par 0025-0026, par 0029, “Both expanded and non-expanded instructions in pipeline stage 112 may be checked for hazard conditions against instructions OOQ 106 using hazard detection logic 114.  A detailed implementation of hazard detection logic 114 has been provided with reference to FIG. 2. As previously described, each of entries 106_0-106_15 of OOQ 106 may comprise instructions comprising a maximum of three (3) operand fields, and the total number of operand fields of instructions in pipeline stage 112 of pipelines VX 116, VL 118, and VS 120 is seven (7) … It will be recognized that these 21 comparisons includes comparisons of all source and destination operand fields of instructions in pipeline stage 112 with each of entries 106_0-106_15” …. comparisons of all source and destination operand fields of instructions to detect the hazard).
It would have been obvious to a person of ordinary skill in the art before the effective filing data of the claimed invention to modify White as modified by Luick to include wherein the hazard circuitry is configured to compare one or more operand locations of a received instruction with one or more operand locations of a threshold number of previously-received instructions to generate the hazard information as taught by Dockser et al. to provide an efficient techniques for detecting data hazards for instructions comprising operands expressed in terms of a range of registers, without requiring expansion.

Regarding claim 27, White as modified by Luick teaches all the limitation of claim 21, and White further teach further comprising: expansion circuitry configured to: receive the set of instructions from the cache circuitry for execution by the execution circuitry (White: col 10:53-67 and col 11:1-11, “ the reorder buffer tag locating the instruction within reorder buffer 32 is conveyed upon operand data/tag bus 308. One operand data value and one operand tag are provided for each operand of the instruction upon operand data/tag bus 308. As above, validity indicators may be asserted for each data and tag value by reorder buffer 32, such that reservation stations 22 may discern which is being provided for a particular operand (e.g. data or reorder buffer tag). Intra-line dependency checking logic 330 provides an indication of intra-line dependencies upon bus 310 as discussed above”, Luick: Figs 2-3, par 0038-0042, “ the issue and dispatch circuitry 234 may be used to form instruction groups and issue the formed instruction groups to the core 114 … The execution unit may also read data from a register file, calculate addresses, perform integer arithmetic functions (e.g., using an arithmetic logic unit, or ALU), perform floating point arithmetic functions, execute instruction branches, perform data access functions (e.g., loads and stores from memory), and store data back to registers (e.g., in the register file 240). In some cases, the core 114 may utilize instruction fetching circuitry 236, the register file 240, cache load and store circuitry 250, and write-back circuitry, as well as any other circuitry, to perform these functions”), but keeps silent for teaching expansion circuitry configured to: generate, for a received instruction that the execution circuitry is not configured to perform natively, multiple instructions that are performable by the execution circuitry.
In related endeavor, Dockser et al. further teach expansion circuitry configured to: generate, for a received instruction that the execution circuitry is not configured to perform natively, multiple instructions that are performable by the execution circuitry (par 0007, par 0013, “the methodology may further involve receiving a SIMD instruction calling for processing of 128-bit data and expanding that SIMI instruction to two instructions calling for processing of data of the 64-bit data width.  The method then involves executing the two instructions resulting from the expansion in sequence through the one operational 64-bit ALU”; Fig 1, par 0024, par 0032, “The IQ stage 15 supplies instructions, in sequence, to an instruction expand stage 17.  The SIMD co-processor 11 can provide parallel processing in a number of different data width modes.  Although there may be more modes or variations in the data widths supported in each mode, the example shows a configuration of the co-processor 11 supporting 64-bit operation and 128-bit operation”).
It would have been obvious to a person of ordinary skill in the art before the effective filing data of the claimed invention to modify Choquette et al. as modified by White to include expansion circuitry configured to: generate, for a received instruction that the execution circuitry is not configured to perform natively, multiple instructions that are performable by the execution circuitry as taught by Dockser to provide a technique to selectively control power to a parallel element of a SIMD processor or the like, so as to effectively reduce power consumption.

Regarding claim 33, White as modified by Luick teaches teach all the limitation of claim 32, the claim 33 is similar in scope to claim 22 and is rejected under the same rational.

Regarding claim 38, White as modified by Luick teaches teach all the limitation of claim 36, the claim 38 is similar in scope to claim 27 and is rejected under the same rational.

Allowable Subject Matter
 	Claims 28 and 39 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The following is a statement of reasons for the indication of allowable subject matter: The cited prior art fails to teach the combination of elements recited in claims 28 and 39, including "wherein the expansion circuitry is further configured to notify the execution circuitry when an instruction will be expanded to multiple instructions; and wherein the execution circuitry is configured to identify first and last instructions of the set of instructions based on the notification from the expansion circuitry".

Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Jin Ge whose telephone number is (571)272-5556. The examiner can normally be reached 8:00 to 5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jennifer Mehmood can be reached on (571)272-2976. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

JIN . GE
Examiner
Art Unit 2616



/JIN GE/Primary Examiner, Art Unit 2612