DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . Claims 1-2 and 4-20 are pending under this Office action.

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection. Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114. Applicant's submission filed on 01/013/2021 has been entered.


Response to Amendment
Applicant's arguments filed on January 13, 2021, have been fully considered.
Applicant argues that the independent claims 1, 10, and 17 are amended with new limitations of "a load-store unit (LSU) coupled to the first PE, the LSU accessing the register file through the first PE and the LSU unable to access the second PE" (emphasis added). Applicant argues that the prior arts on record do not disclose or suggest the claimed features as claimed in the independent claim 1.
Examiner replies that the newly added limitations may overcome the current rejection. However, a new art has been found, and the new art Benthin, etc. (US 20180293784 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-2 and 4-20 are rejected under 35 U.S.C. 103 as being unpatentable over Alsup, etc. (US 20150324198 A1) in view of Lauritzen, etc. (US 20180286112 A1), further in view of Benthin, etc. (US 20180293784 A1).
Regarding claim 1, Alsup teaches that a shader core (See Alsup: Figs. 4 and 16, and [0178], "FIG. 16 shows an example block diagram of a graphics pipeline 1600 for a graphics processor or GPU, according to an embodiment. Without loss of generality, in one embodiment a shader core comprises of 4 processing element 400 (FIG. 4) pairs and a number of fixed function units"), comprising:
a first processing element (PE) comprising a first predetermined number of execution units (See Alsup: Figs. 2-4 and 16, and [0056], "In one embodiment, a number of work units 305 are bound into a single hardware thread and then a number of those threads are bound together to execute a shader program into a structure referred to as a WARP. A WARP binds a multiplicity of work units 305 into a single point of control. Without loss of generality, the WARP may contain up to 32 hardware threads, and a compiler of a GPU (e.g., part of the GPU module 129, FIG. 2) may pack up to 4 units of work 305 (e.g., braid -4 330) into a single hardware thread. Without loss of generality, a processing element 400 (FIG. 4) may manage up to 8 WARPs");
a second PE comprising a second predetermined number of execution units, the second predetermined number of execution units being less than the first predetermined number of execution units (See Alsup: Figs. 4 and 16, and [0187], "In one embodiment, a graphics processing slice consists of eight processing elements 400 (FIG. 4), a number of fixed function units, and an interface to the GPU network"; and [0178], "In one embodiment, some of the  
a register file shared by the first PE and the second PE (See Alsup: Figs. 4-5, and [0058], "Without loss of generality, the register file 420 (FIG. 4) contains 32KBytes of storage, which may be allocated to various WARPs. Without loss of generality, when the shader program uses 32 or fewer registers per thread, all 8 WARPs may be active simultaneously. In many embodiments, WARPs from different shaders may have different sized Register Files. Without loss of generality, the size of a given register file 420 is found in the shader header 610 (FIG. 5)"; and Fig. 5, and [0082], "In one embodiment, when the Fixed Function Specifier bit 613 (F) is set, the first trace 620-621 in a shader 600 (i.e., trace number 0 or Trace 0) contains instructions for fixed function units. These instructions run autonomously and potentially concurrently with WARP execution. If the F bit 613 is not set, then trace 0 is the first trace 620-621 to be executed by the shader program");
a load-store unit (LSU) coupled to the first PE, the LSU accessing the register file through the first PE and the LSU unable to access the second PE; and
a warp sequencer unit (WSQ) coupled to the first PE and to the second PE (See Alsup: Fig. 4, and [0072], "Without loss of generality, in one embodiment the Instruction Store 410 contains the instruction decoder and the instruction sequencer"), the WSQ scheduling an instruction trace to execute on the first PE or the second PE based on information contained in a trace header of the instruction trace (See Alsup: Fig. 6, and [0063], "In one example embodiment, a trace 650 (FIG. 6) is a shader program fragment and consists of a trace header 
However, Alsup fails to explicitly  disclose that a first processing element  (PE)  comprising a first predetermined number of execution units; and a load-store unit (LSU) coupled to the first PE, the LSU accessing the register file through the first PE and the LSU unable to access the second PE.
However, Lauritzen teaches that a first processing element (PE) comprising a first predetermined number of execution units (See Lauritzen: Fig. 5, and [0115], "For example, a shader unit (e.g., graphics multiprocessor 234 of FIG. 3) may be configured to perform the functions of one or more of a vertex processing unit 504, a tessellation control processing unit 508, a tessellation evaluation processing unit 512, a geometry processing unit 516, and a 
Therefore, it would have been obvious to one of ordinary skill in the art at the time of the invention was effectively filed to modify Alsup to have a first processing element (PE) comprising a first predetermined number of execution units as taught by Lauritzen in order to support a wider variety of operations for processing vertex and fragment data (See Lauritzen: [0002], "Current parallel graphics data processing includes systems and methods developed to perform specific operations on graphics data such as, for example, linear interpolation; tessellation, rasterization, texture mapping, depth testing; etc. Traditionally, graphics processors used fixed function computational units to process graphics data; however; more recently, portions of graphics processors have been made programmable; enabling such 
However, Alsup, modified by Lauritzen, fails to explicitly disclose that a load-store unit (LSU) coupled to the first PE, the LSU accessing the register file through the first PE and the LSU unable to access the second PE. 
However, Benthin teaches that a load-store unit (LSU) coupled to the first PE, the LSU accessing the register file through the first PE and the LSU unable to access the second PE (See Benthin: Figs. 23A-D, and [00189], "The GPGPU cores 2362 and load/store units 2366 are coupled with cache memory 2372 and shared memory 2370 via a memory and cache interconnect 2368”; and [0191], “The register file 2358 provides a set of registers for the functional units of the graphics multiprocessor 2424.  The register file 2358 provides temporary storage for operands connected to the data paths of the functional units (e.g., GPGPU cores 2362, load/store units 2366) of the graphics multiprocessor 2424.  In one embodiment, the register file 2358 is divided between each of the functional units such that each functional unit is allocated a dedicated portion of the register file 2358.  In one embodiment, the register file 2358 is divided between the different warps being executed by the graphics multiprocessor 
Therefore, it would have been obvious to one of ordinary skill in the art at the time  of the invention was effectively filed to modify Alsup to have a load-store unit (LSU) coupled to the first PE, the LSU accessing the register file through the first PE and the LSU unable to access the second PE as taught by Benthin in order to enable rapid preemption and context switching of threads executing on the processing array (See Benthin: Figs. 23A-D, and [0168], "When the host interface 2306 receives a command buffer via the I/O unit 2304, the host interface 2306 can direct work operations to perform those commands to a front end 2308.  In one embodiment the front end 2308 couples with a scheduler 2310, which is configured to distribute commands or other work items to a processing cluster array 2312.  In one embodiment the scheduler 2310 ensures that the processing cluster array 2312 is properly configured and in a valid state before tasks are distributed to the processing clusters of the processing cluster array 2312.  In one embodiment the scheduler 2310 is implemented via firmware logic executing on a microcontroller.  The microcontroller implemented scheduler 2310 is configurable to perform complex scheduling and work distribution operations at coarse and fine granularity, enabling rapid preemption and context switching of threads executing on the processing array 2312.  In one embodiment, the host software can prove workloads for scheduling on the processing array 2312 via one of multiple graphics processing doorbells.  The workloads can then be automatically distributed across the processing array 2312 by the 
Regarding claim 2, Alsup, Lauritzen, and Benthin teach all the features with respect to claim 1 as outlined above. Further, Alsup teaches that the shader core of claim 1, wherein the information contained in the trace header indicates whether the instruction trace is executable on the second PE (See Alsup: Fig. 5, and [0082], "Without loss of generality, in one embodiment the shader header 610 contains a trace count 611 of the number of traces 620- 621 in the shader program, the register count 612 of the number of registers per thread, group control information 615, and a Fixed Function bit 613. Without loss of generality, in one embodiment immediately following the shader header 610 is the Active Search Table 616 that includes the same number of bits as there are traces in the shader program").
Regarding claim 4, Alsup, Lauritzen, and Benthin teach all the features with respect to claim 1 as outlined above. Further, Alsup and Lauritzen teach that the shader core of claim 1, wherein the first predetermined number of execution units comprises a third predetermined 
wherein the second predetermined number of execution units comprises a fourth predetermined number of types of execution units, the fourth predetermined number of types of execution units being less than the third predetermined number of types of execution units (See Alsup: Figs. 4 and 16, and [0188], "In one embodiment, a register address presented to 
Regarding claim 5, Alsup, Lauritzen, and Benthin teach all the features with respect to claim 4 as outlined above. Further, Alsup teaches that the shader core of claim 4, wherein the fourth predetermined number of types of execution units includes a floating-point-type of execution unit and an integer-processing-type of execution unit (See Alsup: Fig. 4, and [0075], "Without loss of generality, the FMAD units perform single precision floating point arithmetic instructions. Without loss of generality, the Integer unit performs most integer arithmetic, logic operations, and memory address calculations. Without loss of generality, the BIT manipulation unit performs shifting and bit manipulation operations").
Regarding claim 6, Alsup, Lauritzen, and Benthin teach all the features with respect to claim 4 as outlined above. Further, Lauritzen teaches that the shader core of claim 4, wherein the third predetermined number of types of execution units includes at least one of a floating-point-type of execution unit, an integer-processing-type of execution unit, a sine- function-type of execution unit, a cosine-function-type of execution number, a reciprocal- function-type of execution unit, a square-root-function-type of execution unit, and a format- conversion-type execution unit (See Lauritzen: Fig. 2C, and [0052], "The functional execution logic supports a variety of operations including integer and floating point arithmetic,   comparison operations, Boolean operations, bit-shifting, and computation of various algebraic functions").
Regarding claim 7, Alsup, Lauritzen, and Benthin teach all the features with respect to claim 1 as outlined above. Further, Alsup and Lauritzen teach that the shader core of claim 1, wherein the register file comprises a vector register file (See Lauritzen: Fig. 21, and [0238], "For 
Regarding claim 8, Alsup, Lauritzen, and Benthin teach all the features with respect to claim 7 as outlined above. Further, Alsup teaches that the shader core of claim 7, wherein the vector register file comprises two read ports and two write  ports, and wherein  the  scalar register file comprises  two  read ports and two  write  ports (See  Alsup: Fig. 10, and [0073], "Without loss of generality, a  set  of flip-flops  known as a  collector  is used to  sequence  values out of and in to  the SRAM based register file. The SRAM instance is read and written twice as wide as the desired operand or result. Over a 2 cycle period, one pair of operands is read then a successive pair of operands is read. Then over a second 2 cycle period, first one value of a pair and then the other value of the pair is delivered to an operand bus or received from the result bus by the collectors. By this means, the register file appears to have 2 ports while the SRAM has but 1 port").
Regarding claim 9, Alsup, Lauritzen, and Benthin teach all the features with respect to claim 1 as outlined above. Further, Alsup teaches that the shader core of claim 1, wherein the shader core is part of a graphics processing unit (GPR) that comprises at least one shader core (See Alsup: Fig. 4, and [0085], "Without loss of generality, constant scratch is shared across 4 
Regarding claim 10, Alsup, Lauritzen, and Benthin teach all the features with respect to claim 1 as outlined above. Further, Alsup, Lauritzen, and Benthin teach that a graphics processor unit (GPU) (See Alsup: Figs. 4 and 16, and [0178], "FIG. 16 shows an example block diagram of a graphics pipeline 1600 for a graphics processor or GPU, according to an embodiment. Without loss of generality, in one embodiment a shader core comprises of 4 processing element 400 (FIG. 4) pairs and a number of fixed function units"), comprising:
at least one shader core (See Alsup: Fig. 4, and [0085], "Without loss of generality, constant scratch is shared across 4 processing elements 400 (FIG. 4) in a Shader core of a GPU"), the shader core comprising:
a first processing element (PE) (See Alsup: Figs. 2-4 and 16, and [0056], "In one embodiment, a number of work units 305 are bound into a single hardware thread and then a number of those threads are bound together to execute a  shader  program  into a  structure referred to as a  WARP.  A WARP binds a multiplicity of work units 305 into a single point of control. Without loss of generality, the WARP may contain up to 32 hardware threads, and a compiler of a GPU (e.g., part of the GPU module 129, FIG. 2) may pack up to 4 units of work 305 (e.g., braid -4 330) into a single hardware thread. Without loss of generality, a processing element 400 (FIG. 4) may manage up to 8 WARPs") comprising a first predetermined number of execution units (See Lauritzen: Fig. 5, and [0115], "For example, a shader unit (e.g., graphics 
a second PE comprising a second predetermined number of execution units, the second predetermined number of execution units being less than the first predetermined number of execution units (See Alsup: Figs. 4 and 16, and [0187], "In one embodiment, a graphics processing slice consists of eight processing elements 400 (FIG. 4), a number of fixed function units, and an interface to the GPU network"; and [0178], "In one embodiment, some of the  fixed function units (e.g., the Load Store) are distributed with the processing element 400 pairs, 
a register file shared by the first PE and the second PE (See Alsup: Figs. 4-5, and [0058], "Without loss of generality, the register file 420 (FIG. 4) contains 32KBytes of storage, which may be allocated to various WARPs. Without loss of generality, when the shader program uses 32 or fewer registers per thread, all 8 WARPs may be active simultaneously. In many embodiments, WARPs from different shaders may have different sized Register Files. Without loss of generality, the size of a given register file 420 is found in the shader header 610 (FIG. 5)"; and Fig. 5, and [0082], "In one embodiment, when the Fixed Function Specifier bit 613 (F) is set, the first trace 620-621 in a shader 600 (i.e., trace number 0 or Trace 0) contains instructions for fixed function units. These instructions run autonomously and potentially concurrently with WARP execution. If the F bit 613 is not set, then trace 0 is the first trace 620-621 to be executed by the shader program");
a load-store unit (LSU) coupled to the first PE, the LSU accessing the register file through the first PE and the LSU unable to access the second PE (See Benthin: Figs. 23A-D, and [00189], "The GPGPU cores 2362 and load/store units 2366 are coupled with cache memory 2372 and shared memory 2370 via a memory and cache interconnect 2368”; and [0191], “The register file 2358 provides a set of registers for the functional units of the graphics multiprocessor 2424.  The register file 2358 provides temporary storage for operands connected to the data paths of the functional units (e.g., GPGPU cores 2362, load/store units 2366) of the graphics multiprocessor 2424.  In one embodiment, the register file 2358 is divided between each of the functional units such that each functional unit is allocated a dedicated portion of the register 
a warp sequencer unit (WSQ) coupled to the first PE and to the second PE (See Alsup: Fig. 4, and [0072], "Without loss of generality, in one embodiment the Instruction Store 410 contains the instruction decoder and the instruction sequencer"), the WSQ scheduling an instruction trace to execute on the first PE or the second PE based on information contained in a trace header of the instruction trace (See Alsup: Fig. 6, and [0063], "In one example embodiment, a trace 650 (FIG. 6) is a shader program fragment and consists of a trace header 670 and a number of instructions 660-661. Without loss of generality, in one embodiment the trace header 670 specifies a set of resources that must be available prior to running the instructions 660-661 with the trace 650 and a set of outstanding request that must have been performed prior to scheduling this WARP back into execution. The WARP scheduler uses this information in deciding which WARP to schedule"; and [0087], "Without loss of generality, in one embodiment the trace header 670 contains fields used to represent the outstanding events that must occur prior to this WARP being &lt;re&gt;scheduled. Without loss of generality, in one embodiment the trace header 670 includes fields for up to 8-outstanding memory references 679, up to 8-outstanding texture references 678, and up to 8 outstanding Interpolation references (IPA) 677 simultaneously. In one example embodiment, another field 
Regarding claim 11, Alsup, Lauritzen, and Benthin teach all the features with respect to claim 10 as outlined above. Further, Alsup teaches that the GPU of claim 10, wherein the information contained in the trace header indicates whether the instruction trace is executable on the second PE (See Alsup: Fig. 5, and [0082], "Without loss of generality, in one embodiment the shader header 610 contains a trace count 611 of the number of traces 620- 621 in the shader program, the register count 612 of the number of registers per thread, group control information 615, and a Fixed Function bit 613. Without loss of generality, in one embodiment immediately following the shader header 610 is the Active Search Table 616 that includes the same number of bits as there are traces in the shader program").
Regarding claim 12, Alsup, Lauritzen, and Benthin teach all the features with respect to claim 10 as outlined above. Further, Alsup and Lauritzen teach that the GPU of claim 10, wherein the first predetermined  number of execution units comprises a third predetermined number of types of execution units (See Lauritzen: Fig. 5, and [0115], "For example, a shader unit (e.g., graphics multiprocessor 234 of FIG. 3) may be configured to perform the functions of one or more of a vertex processing unit 504, a tessellation control processing unit 508, a tessellation evaluation processing unit 512, a geometry processing unit 516, and a fragment/pixel processing unit 524. The functions of data assembler 502, primitive assemblers 506, 514, 518, tessellation unit 510, rasterizer 522, and raster operations unit 526 may also be performed by other processing engines within a processing cluster (e.g., processing cluster 214 
wherein the second predetermined number of execution units comprises a fourth predetermined number of types of execution units, the fourth predetermined number of types of execution units being less than the third predetermined number of types of execution units (See Alsup: Figs. 4 and 16, and [0188], "In one embodiment, a register address presented to centralized fixed function units contains a Processing Element Number, a Register File bit, and a relocated register address").
Regarding claim 13, Alsup, Lauritzen, and Benthin teach all the features with respect to claim 12 as outlined above. Further, Alsup teaches that the GPU of claim 12, wherein the fourth predetermined number of types of execution units includes a floating-point-type of execution unit and an integer-processing-type of execution unit (See Alsup: Fig. 4, and [0075], "Without loss of generality, the FMAD units perform single precision floating point arithmetic 
Regarding claim 14, Alsup, Lauritzen, and Benthin teach all the features with respect to claim 12 as outlined above. Further, Lauritzen teaches that the GPU of claim 12, wherein the third predetermined number of types of execution units includes at least one of a floating-point-type of execution unit, an integer-processing-type of execution unit, a sine- function-type of execution unit, a cosine-function-type of execution number, a reciprocal- function-type of execution unit, a square-root-function-type of execution unit, and a format- conversion-type execution unit (See Lauritzen: Fig. 2C, and [0052], "The functional execution logic supports a variety of operations including integer and floating point arithmetic, comparison operations, Boolean operations, bit-shifting, and computation of various algebraic functions").
Regarding claim 15, Alsup, Lauritzen, and Benthin teach all the features with respect to claim 10 as outlined above. Further, Alsup and Lauritzen teach that the GPU of claim 10, wherein the register file comprises a vector register file (See Lauritzen: Fig. 21, and [0238], "For example, when operating on a 256-bit wide vector, the 256 bits of the vector are stored in a register and the execution unit operates on the vector as four separate 64-bit packed data elements (Quad- Word (QW) size data elements), eight separate 32-bit packed data elements (Double Word (DW) size data elements), sixteen separate 16-bit packed data elements (Word (W) size data elements), or thirty-two separate 8-bit data elements (byte (B) size data elements)") and a scalar register file (See Alsup: Fig. 4, and [057], "Without loss of generality, each WARP is associated with 64-registers in the scalar register file"). 
claim 16, Alsup, Lauritzen, and Benthin teach all the features with respect to claim 15 as outlined above. Further, Alsup teaches that the GPU of claim 15, wherein the vector register file comprises two read ports and two write ports, and wherein the scalar register file comprises two read ports and two write ports (See Alsup: Fig. 10, and [0073], "Without loss of generality, a set of flip-flops known as a collector is used to sequence values out of and in to the SRAM based register file. The SRAM instance is read and written twice as wide as the desired operand or result. Over a 2 cycle period, one pair of operands is read then a successive pair of operands is read. Then over a second 2 cycle period, first one value of a pair and then the other value of the pair is delivered to an operand bus or received from the result bus by the collectors. By this means, the register file appears to have 2 ports while the SRAM has but 1 port").
Regarding claim 17, Alsup, Lauritzen, and Benthin teach all the features with respect to claim 1 as outlined above. Further, Alsup, Lauritzen, and Benthin teach that a graphics processor unit (GPU) (See Alsup: Figs. 4 and 16, and [0178], "FIG. 16 shows an example block diagram of a graphics pipeline 1600 for a graphics processor or GPU, according to an embodiment. Without loss of generality, in one embodiment a shader core comprises of 4 processing element 400 (FIG. 4) pairs and a number of fixed function units"), comprising:
at least one shader core (See Alsup: Fig. 4, and [0085], "Without loss of generality, constant scratch is shared across 4 processing elements 400 (FIG. 4) in a Shader core of a GPU"), the shader core comprising: 
a first processing element (PE) (See Alsup: Figs. 2-4 and 16, and [0056], "In one embodiment, a number of work units 305 are bound into a single hardware thread and then a 
a second PE comprising a second predetermined number of types of execution units, the second predetermined number of types of execution units being less than the first predetermined number of types of execution units (See Alsup: Figs. 4 and 16, and [0187], "In one embodiment, a graphics processing slice consists of eight processing elements 400 (FIG. 4), a number of fixed function units, and an interface to the GPU network"; and [0178], "In one embodiment, some of the fixed function units (e.g., the Load Store) are distributed with the processing element 400 pairs, while others such as Texture and Interpolation are centralized". The fixed function units may be corresponding to the second PE);
a register file shared by the first PE and the second PE (See Alsup: Figs. 4-5, and [0058], "Without loss of generality, the register file 420 (FIG. 4) contains 32KBytes of storage, which may be allocated to various WARPs. Without loss of generality, when the shader program uses 32 or fewer registers per thread, all 8 WARPs may be active simultaneously. In many embodiments, WARPs from different shaders may have different sized Register Files. Without loss of generality, the size of a given register file 420 is found in the shader header 610 (FIG. 5)"; and Fig. 5, and [0082], "In one embodiment, when the Fixed Function Specifier bit 613 (F) is set, the first trace 620-621 in a shader 600 (i.e., trace number 0 or Trace 0) contains instructions for fixed function units. These instructions run autonomously and potentially concurrently with WARP execution. If the F bit 613 is not set, then trace 0 is the first trace 620-621 to be executed by the shader program");

a warp sequencer unit (WSQ) coupled to the first PE and to the second PE (See Alsup: Fig. 4, and [0072], "Without loss of generality, in one embodiment the Instruction Store 410 contains the instruction decoder and the instruction sequencer"), the WSQ scheduling an instruction trace to execute on the first PE or the second PE based on information contained in a trace header of the instruction trace, the information contained in the trace header indicating whether the instruction trace is executable on the second PE (See Alsup: Fig. 6, and [0063], "In one example embodiment, a trace 650 (FIG. 6) is a shader program fragment and consists of a 
Regarding claim 18, Alsup, Lauritzen, and Benthin teach all the features with respect to claim 17 as outlined above. Further, Alsup teaches that the GPU of claim 17, wherein the second predetermined number of types of execution units includes a floating-point-type of execution unit and an integer-processing-type of execution unit (See Alsup: Fig. 4, and [0075], "Without loss of generality, the FMAD units perform single precision floating point arithmetic instructions. Without loss of generality, the Integer unit performs most integer arithmetic, logic operations, and memory address calculations. Without loss of generality, the BIT manipulation unit performs shifting and bit manipulation operations").
claim 19, Alsup, Lauritzen, and Benthin teach all the features with respect to claim 17 as outlined above. Further, Lauritzen teaches that the GPU of claim 17, wherein the first predetermined number of types of execution units includes at least one of a floating-point-type of execution unit, an integer-processing-type of execution unit, a sine- function-type of execution unit, a cosine-function-type of execution number, a reciprocal- function-type of execution unit, a square-root-function-type of execution unit, and a format- conversion-type execution unit (See Lauritzen: Fig. 2C, and [0052], "The functional execution logic supports a variety of operations including integer and floating point arithmetic, comparison operations, Boolean operations, bit-shifting, and computation of various algebraic functions").
Regarding claim 20, Alsup, Lauritzen, and Benthin teach all the features with respect to claim 17 as outlined above. Further, Alsup and Lauritzen teach that the GPU core of claim 17, wherein the register file comprises a vector register file (See Lauritzen: Fig. 21, and [0238], "For example, when operating on a 256-bit wide vector, the 256 bits of the vector are stored in a register and the execution unit operates on the vector as four separate 64-bit  packed data elements (Quad-Word (QW) size data elements), eight separate 32-bit packed data elements (Double Word (DW) size data elements), sixteen separate 16-bit packed data elements (Word (W) size data elements), or thirty-two separate 8-bit data elements (byte (B) size data elements)") and a scalar register file (See Alsup: Fig. 4, and [057], "Without loss of generality, each WARP is associated with 64-registers in the scalar register file"),
wherein the vector register file comprises two read ports and two write ports, and wherein the scalar register file comprises two read ports and two write ports (See Alsup: Fig. 10, and [0073], "Without loss of generality, a set of flip-flops known as a collector is used to 



Claim Rejections - 35 USC § 103

Any inquiry concerning this communication or earlier communications from the examiner should be directed to GORDON G LIU whose telephone number is (571)270-0382.  The examiner can normally be reached on Monday - Friday 8:00-5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jennifer Mehmood can be reached on 571-272-2976.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.







/GORDON G LIU/Primary Examiner, Art Unit 2612