Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION
This is in response to applicant’s amendment/response filed on 03/10/2021, which has
been entered and made of record.  Claim 9 are amended. Claims 1-20 are pending in the application.
		
Response to Arguments
Applicant arguments regarding claim rejections under 103 are considered, but are not persuasive. 
	Applicant argues:

    PNG
    media_image1.png
    309
    706
    media_image1.png
    Greyscale

	Examiner disagrees: First, the limitations recites “work items within a block of work items are re-ordered based on the validity of the work items”; Second, the limitations do not recites “recordering of the pixels.” As Applicant explained above, the blocks of quads given in 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claim 1, 4-7, 11-13, 16-17, 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Mejdrich et al (US 2009/0150647 A1, hereinafter Mejdrich) in view of Teruyama et al. (US 2007/0182750 A1, hereinafter Teruyama).
Regarding claim 1, Mejdrich teaches:
A processing unit configured to process a plurality of tasks which each include one or more work items, wherein the work items of a task are arranged for executing instructions on respective data items, ([0004], “Another popular technique for increasing performance is to use a single instruction multiple data (SIMD) architecture, which is also referred to as `vectorizing` the data. In this manner, operations are performed on multiple data elements at the same time, and in response to the same SIMD instruction. A vector execution unit typically includes multiple processing lanes that handle different datapoints in a vector and perform similar operations on all of the datapoints at the same time. For example, for an architecture that relies on quad(4)word vectors, a vector execution unit may include four processing lanes that perform the identical operations on the four words in each vector.”) 
the data items being arranged into blocks of data items, ([0004], “. For example, for an architecture that relies on quad(4)word vectors, a vector execution unit may include four processing lanes that perform the identical operations on the four words in each vector”)
wherein blocks of work items within a task relate to respective blocks of data items, wherein if one or more of the data items within a block of data items is to be processed then all of the data items within that block of data items are scheduled for processing by the processing unit, ([0004], “. For example, for an architecture that relies on quad(4)word vectors, a vector execution unit may include four processing lanes that perform the identical operations on the four words in each vector”) and 
the processing unit comprising: 
a group of processing lanes configured to execute instructions of work items of a particular task over a plurality of processing cycles, wherein each of the processing lanes of the group is configured to execute instructions of a respective block of work items over a plurality of consecutive processing cycles; [0010], “processing lanes can be selectively grouped together to operate as different types of vector execution units”, [0011], “a vectorizable execution unit may be provided with multiple processing lanes that in one mode, are grouped together into the same logical execution unit such that the processing lanes operate collectively as a single vector or SIMD execution unit”. [0023], “A logical execution unit, in this regard, constitutes one or more physical processing lanes defined in an execution unit, where a physical processing lane typically incorporates execution logic configured to perform one or more data processing operations, in one or more stages, responsive to an instruction provided thereto. A logical execution unit, in addition, typically is capable of receiving up to one instruction (typically a vector or scalar instruction) per cycle, although if the processing lanes incorporated in a logical execution unit are pipelined, multiple instructions may be at different stages of execution in a logical execution unit at any given time. Where a given mode of a vectorizable execution unit organizes the processing lanes into multiple logical execution units, those units are typically capable of being operated independently and in parallel with one another”) and 
However, Mejdrich does not teach:
wherein one or more of the blocks of work items include at least one invalid work item,
a control module configured to assemble the work items into the tasks so that work items within a block of work items are re-ordered based on the validity of the work items.
	On the other hand, Teruyama teaches:
wherein one or more of the blocks of work items include at least one invalid work item, ([0138], “The quad merge operation is such a process as described below with reference to FIG. 13. FIG. 13 is a conceptual drawing of a quad merge operation. The quad merge operation involves merging two temporally successive stamps with the same XY coordinates into one stamp. By the quad merge, valid quads in two stamps can be compounded into one stamp and can be processed at a time. Thus, the amount of data to be subjected to the rendering process can be compressed.”)
a control module configured to assemble the work items into the tasks so that work items within a block of work items are re-ordered based on the validity of the work items. ([0138]-[0139], “The quad merge operation is such a process as described below with reference to FIG. 13. FIG. 13 is a conceptual drawing of a quad merge operation. The quad merge operation involves merging two temporally successive stamps with the same XY coordinates into one stamp. By the quad merge, valid quads in two stamps can be compounded into one stamp and can be processed at a time. Thus, the amount of data to be subjected to the rendering process can be compressed. As shown in FIG. 13, the four quads contained in one stamp are hereinafter referred to as quads Q0 to Q3. It is assumed that first, the stamp 1 in which the quads Q0 and Q2 are valid, whereas the quads Q1 and Q3 are invalid is input to the instruction control unit and that the stamp 2 in which the quads Q1 and Q2 are valid, whereas the quads Q0 and Q3 are invalid is subsequently input to the instruction control unit. In this case, the two stamps 1 and 2 are merged to generate a new stamp containing the quads Q0 and Q2 of the stamp 1 and the quads Q1 and Q2 of the stamp 2. This process is the quad merge 
	Mejdrich teaches a processing unit that groups work items on data items and implement them together in a processing cycles. However, Mejdrich does not explicitly consider the situation that some work items may have invalid data. Teruyama teaches a processing unit that handles the situation that some work items may have invalid data by merging the valid data items together and disable the implementation of the work items on invalid data.
 	It would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to have combined the teaching of Mejdrich with the data items merging teaching of Teruyama. The benefit would be to avoid the implementation of work items on invalid data, thus improve system response time.

Regarding claim 4, Mejdrich in view of Teruyama teaches:
The processing unit of claim 1, wherein the control module is configured to set indicators to indicate how the work items have been assembled into the tasks. (Teruyama, [0117], “For example, in FIG. 6, a stamp with STID=15 is inside the triangle. Accordingly, all the pixels contained in this stamp need to be drawn. However, for example, for a stamp with STID=7, pixels with PIXIDs=0 to 8, 12, 13, and 15 are outside the triangle and need not be drawn. Only the pixels with PIXIDs=9 to 11 and 14 need to be drawn. Thus, pixels that need to be drawn are hereinafter referred to as "valid" pixels, whereas pixels that need not be drawn are hereinafter referred to as "invalid" pixels.”[0138]-[0139], “The quad merge operation is 

Regarding claim 5, Mejdrich in view of Teruyama teaches:
The processing unit of claim 4, further comprising: a store configured to store the processed data items output from the group of processing lanes; ([0034], “A given GPCs 208 may process data to be written to any of the DRAMs 220 within PP memory 204.”) and storing logic configured to determine addresses for storing the processed data items in the store based on the indicators.([0043], “Each GPC 208 may have an associated memory management unit (MMU) 320 that is configured to map virtual addresses into physical addresses. In various embodiments, MMU 320 may reside either within GPC 208 or within the memory interface 214. The MMU 320 includes a set of page table entries (PTEs) used to map a virtual address to a physical address of a tile or memory page and optionally a cache line index. The MMU 320 may include address translation lookaside buffers (TLB) or caches that may reside within SMs 310, within one or more L1 caches, or within GPC 208.”  Teruyama teaches using indicator to indicate if a data item is valid or not and grouping data based on their validity. The combination of claim 1 is applied here. It is well-known in the art that similar items are grouped and saved together to improve processing speed. It would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to have combined the teaching of Mejdrich in view of Teruyama with this well-known knowledge to save data based on its indicator. The benefit would be to improve process speed. )

Regarding claim 6, Mejdrich in view of Teruyama teaches:
The processing unit of claim 1, wherein the control module is configured to assemble the work items into the tasks such that work items of a block of work items are grouped together into the same task. (Teruyama, [0117], “For example, in FIG. 6, a stamp with STID=15 is inside the triangle. Accordingly, all the pixels contained in this stamp need to be drawn. However, for example, for a stamp with STID=7, pixels with PIXIDs=0 to 8, 12, 13, and 15 are outside the triangle and need not be drawn. Only the pixels with PIXIDs=9 to 11 and 14 need to be drawn. Thus, pixels that need to be drawn are hereinafter referred to as "valid" pixels, 

Regarding claim 7, Mejdrich in view of Teruyama teaches:
The processing unit of claim 1, wherein the control module is configured to assemble the work items into the tasks so that work items within a block of work items are re-ordered to thereby align the invalid work items from different blocks of work items within a task. 

Regarding claim 11, Mejdrich in view of Teruyama teaches:
The processing unit of claim 1, wherein the control module is configured to assemble the work items into the tasks based on the validity of the work items such that invalid work items of the particular task are temporally aligned across the group of processing lanes. (Teruyama, [0138]-[0139], “The quad merge operation is such a process as described below 

Regarding claim 12, Mejdrich in view of Teruyama teaches:
The processing unit of claim 1, wherein there are more than two levels of validity for the work items, and wherein the control module is configured to assemble the work items into the tasks, based on the validity of the work items, so that work items of the particular task which have the same level of validity are temporally aligned across the group of processing lanes. (Teruyama, [0138]-[0139], “The quad merge operation is such a process as 


Regarding claim 13, Mejdrich in view of Teruyama teaches:
The processing unit of claim 1, wherein the data items are pixel values, and wherein the blocks of data items are pixel quads.( Mejdrich, [0004], “For example, for an architecture that relies on quad(4)word vectors, a vector execution unit may include four processing lanes that perform the identical operations on the four words in each vector.” [0007], “One such 

Regarding claim 16, Mejdrich in view of Teruyama teaches:
The processing unit of claim 1, wherein invalid work items relate to invalid data items. (Teruyama, [0117], “For example, in FIG. 6, a stamp with STID=15 is inside the triangle. Accordingly, all the pixels contained in this stamp need to be drawn. However, for example, for a stamp with STID=7, pixels with PIXIDs=0 to 8, 12, 13, and 15 are outside the triangle and need not be drawn. Only the pixels with PIXIDs=9 to 11 and 14 need to be drawn. Thus, pixels that need to be drawn are hereinafter referred to as "valid" pixels, whereas pixels that need not be drawn are hereinafter referred to as "invalid" pixels.” The combination rationale of claim 1 is incorporated here.)

	Claim 17 recites similar limitations of claim 1, in a form of method, thus are rejected using the same rationale.

Claim 19 recites similar limitations of claim 11, in a form of method, thus are rejected using the same rationale.

Regarding claim 20, Mejdrich in view of Teruyama teaches:
A non-transitory computer readable storage medium having stored thereon processor executable instructions that when executed cause at least one integrated circuit manufacturing system to generate a processing unit which is configured to  (Mejdrich, claim 14, “A program product comprising a computer readable medium and logic definition program code resident on the computer readable medium and defining the circuit arrangement of claim 1.”)
The rest of Claim recites similar limitations of claim 1, thus are rejected using the same rationale.

Claims 2-3, 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Mejdrich in view of Teruyama and further in view of Clery, III (US 6079008, Hereinafter Clery).
Regarding claim 2, Mejdrich in view of Teruyama teaches:
The processing unit of claim 1, further comprising a logic module coupled to the group of processing lanes (Mejdrich, [0023], “Where a given mode of a vectorizable execution unit organizes the processing lanes into multiple logical execution units, those units are typically capable of being operated independently and in parallel with one another”) configured to 
However, Mejdrich in view of Teruyama does not teach:
cause the group of processing lanes to skip the execution of a set of invalid work items if the set of invalid work items are the only work items scheduled for execution over the group of processing lanes in a processing cycle.
On the other hand, Clery teaches:
cause the group of processing lanes to skip the execution of a set of invalid work items if the set of invalid work items are the only work items scheduled for execution over the group of processing lanes in a processing cycle. (Clery, column 23, 1st para, “As a thread repeatedly cycles, it determines whether or not the direct memory access (DMA) operation has provided any new data. If no sampled data is available in a processing unit 14 for that cycle of the thread, the processing unit skips execution on that cycle. ")
Mejdrich in view of Teruyama teaches in order to improve performance, a quad words are loaded to a vector execution unit, which includes four processing lane that perform the identical operations on the four words in each vector. Mejdrich in view of Teruyama aslo teaches grouping invalid work items together. Clery teaches in order to improve performance, a processing lane can skip a particular processing cycle if there is only invalid work iterm scheduled in that particular processing cycle.
It would have been obvious at the time of the invention was effectively filed to a person having ordinary skill in the art to have combined the SIMD processing unit of Mejdrich in view of Teruyama with the skipping processing cycle method of Clery, so a logic module coupled to the groups of processing lanes configured to cause a particular group of processing lanes to skip a particular processing cycle, if there are only invalid work items scheduled for execution in any of the processing lanes of the particular group in the particular processing cycle. The motivation is to improve the total performance of the SIMD processing unit.

Regarding claim 3, Mejdrich in view of Teruyama teaches:
The processing unit of claim 1, further comprising a logic module coupled to the group of processing lanes (Mejdrich, [0023], “Where a given mode of a vectorizable execution unit organizes the processing lanes into multiple logical execution units, those units are typically capable of being operated independently and in parallel with one another”) configured to 
However, Mejdrich in view of Teruyama does not teach:
cause the group of processing lanes to skip a processing cycle if there are no valid work items scheduled for execution over the group of processing lanes in the processing cycle.
On the other hand, Clery teaches:
cause the group of processing lanes to skip a processing cycle if there are no valid work items scheduled for execution over the group of processing lanes in the processing cycle. (Clery, column 23, 1st para, “As a thread repeatedly cycles, it determines whether or not the direct memory access (DMA) operation has provided any new data. If no sampled data is available in a processing unit 14 for that cycle of the thread, the processing unit skips execution on that cycle. ")
Mejdrich in view of Teruyama teaches in order to improve performance, a quad words are loaded to a vector execution unit, which includes four processing lane that perform the identical operations on the four words in each vector. Mejdrich in view of Teruyama aslo teaches grouping invalid work items together. Clery teaches in order to improve performance, a processing lane can skip a particular processing cycle if there Is no valid work item scheduled in that particular processing cycle.
 to skip a particular processing cycle, if there is no valid work items scheduled for execution in any of the processing lanes of the particular group in the particular processing cycle. The motivation is to improve the total performance of the SIMD processing unit.

Claim 18 recites similar limitations of claim 3, in a form of method, thus are rejected using the same rationale.

Claim 8 is/are rejected under 35 U.S.C. 103 as being unpatentable over Mejdrich in view of Teruyama in view of Gschwind et al. (US 2007/0186077 A1, hereinafter Gschwind).
Regarding claim 8, Mejdrich in view of Teruyama teaches:
The processing unit of claim 1, 
However, Mejdrich in view of Teruyama does not teach:
wherein the control module is configured to re-order work items within a block of work items by performing at least one of a rotation operation and a swapping operation of the work items within the block of work items.
On the other hand, Gschwind teaches:
wherein the control module is configured to re-order work items within a block of work items by performing at least one of a rotation operation and a swapping operation of the work items within the block of work items. ([0174], “In accordance with a preferred code generation method, a compiler rotates or shifts scalar data items in a common slot position. In one embodiment, this is the preferred slot. In accordance with another embodiment, the preferred slot is chosen to be the leftmost word slot, allowing the ability to rotate words into the preferred slot with a single quadword rotate instruction using low-order address bits (stored in the preferred slot of a vector register) to specify the rotate count for word data. Those skilled in the art will appreciate the ability to adapt concepts of the preferred slot to other locations within a vector, and appropriate alignment rotate or shift sequences accordingly.” Abstract: “A processor architecture uses a vector register file, a shared data path, and instruction execution logic to process both single instruction multiple data (SIMD) instruction and scalar instructions. The processor architecture divides a vector into four "slots," each including four bytes, and locates scalar data in "preferred slots" to ensure proper positioning. Instructions using the preferred slot mechanism include 1) shift and rotate instructions operating across an entire quad-word that specify a shift amount, 2) memory load and store instructions that require an address, and 3) branch instructions that use the preferred slot for branch conditions (conditional branches) and branch addresses (register-indirect branches). As a result, the processor architecture eliminates the requirement for separate issue slots, separate pipelines, and the control complexity for separate scalar units.”)
Mejdrich in view of Teruyama teaches a processing unit that group work items related data items. Gschwind teaches a specific (rotation) way to group work items.
It would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to have combined the teaching of Mejdrich in view of .

Claim 15 is/are rejected under 35 U.S.C. 103 as being unpatentable over Mejdrich in view of Teruyama in view of Wildman (US 2008/0209164 A1).
Regarding claim 15, Mejdrich in view of Teruyama teaches:
The processing unit of claim 1, wherein the processing unit is a single instruction multiple data (SIMD) processing unit, and wherein the work items of a task are arranged for executing a common sequence of instructions on respective data items. (Mejdrich, page 1 [0004], “Another popular technique for increasing performance is to use a single instruction multiple data (SIMD) architecture, which is also referred to as `vectorizing` the data. In this manner, operations are performed on multiple data elements at the same time, and in response to the same SIMD instruction. A vector execution unit typically includes multiple processing lanes that handle different datapoints in a vector and perform similar operations on all of the datapoints at the same time. For example, for an architecture that relies on quad(4)word vectors, a vector execution unit may include four processing lanes that perform the identical operations on the four words in each vector.”) and 
However, Mejdrich does not teach:
wherein each of the plurality of tasks includes up to a predetermined maximum number of work items,
On the other hand, Wildmant teaches:
wherein each of the plurality of tasks includes up to a predetermined maximum number of work items, (Wildmant, page 2 [0040],  "FIG. 4 illustrates a register file 42 for use in a processing element which includes an execution unit which operates on data stored in the register file, and which is able to process multiple instruction threads. The register file can also be used in the serial processor 10. Such a register file embodies another aspect of the invention. The parallel processor 15 can process a predetermined maximum number of instruction streams (threads). The register file 42 is provided with a set of registers for each such thread.”)
Combing the teachings of Mejdrich with the teachings of Wildmant, since in SIMD, a task might be processed in a processor. Replace the task in Mejdrich by the processor in Wildman.
It would have been obvious at the time of the invention was effectively filed to a person having ordinary skill in the art to have combined the SIMD processing unit of Mejdrich with the predetermined maximum number of work items of Wildmant, so to limit the number of work items in a vector execution unit to not more than the predetermined maximum number. The motivation is to consider and accommodate the hardware limitation when implementing the SIMD processing unit. 


Allowable Subject Matter
Claims 9-10, 14 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The following is a statement of reasons for the indication of allowable subject matter:  
	none of the references on the record teaches or render obvious of the limitations of “control module is configured to set indicators to indicate how the work items have been assembled into the tasks, wherein the control module is configured to set a respective indicator for a plurality of the blocks of work items to indicate the order of the work items within the plurality of blocks of work items.” Recited in claim 9.
	none of the references on the record teaches or render obvious of the limitations of “
wherein some of the tasks comprise fewer than a predetermined maximum number of work items, said processing unit further comprising: a plurality of groups of processing lanes, each group being configured to execute instructions of work items of a respective task in parallel over a plurality of processing cycles; and a logic module coupled to the groups of processing lanes configured to cause a particular group of processing lanes to skip a particular processing cycle, independently of the other groups of processing lanes, if there are no work items scheduled for execution in any of the processing lanes of the particular group in the particular processing cycle.” Recited in claim 14.

Comments
The obvious double patenting rejections with patents 10679319, 10311539, 9250961, 9513963 are overcome in view of the Terminal Disclaimer filed and approved on 11/20/2020 11/24/2020.
Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to YANNA WU whose telephone number is (571)270-0725.  The examiner can normally be reached on Monday-Thursday 8:00-5:30 ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/YANNA WU/Primary Examiner, Art Unit 2611