DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Amendment

The amendment filed on February 8, 2022 has been entered. Claims 21-40 are now pending in the application. Applicant's amendments have addressed all informalities as previously set forth in the non-final action mailed on September 8, 2021.
Response to Arguments
Applicant’s arguments see page 7, filed February 8, 2022, with respect to the claim objections have been fully considered and are persuasive.  The claim objections have been removed based on the current claim amendments.
Applicant’s arguments see page 7, filed February 8, 2022, with respect to the 35 U.S.C. 112(b) rejections have been fully considered and are persuasive.  The 35 U.S.C. 112(b) rejections have been removed based on the current claim amendments.
Applicant’s arguments see page 7, filed February 8, 2022, with respect to the obviousness type double patenting rejections have been fully considered and are persuasive.  The obviousness type double patenting rejections have been removed based on the current claim amendments.
Applicant’s arguments, see pages 7-14, filed February 8, 2022, with respect to the rejections of previous claims 1-20 under 35 U.S.C. 103 have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration, a new grounds of rejection is made in view of Rogers (US 2012/0139930 A1).

In regards to independent claim 1, McCrary was previously cited as it discloses an apparatus and methods for scheduling and executing commands issued by a first processor, such as a CPU, on a second processor, such as a GPU, are disclosed, as the GPU may also receive a priority ordering of the one or more buffers, where the selecting is further based on the received priority ordering. By performing prioritization and scheduling of commands in the GPU, system performance is enhanced (see abstract). 
Hartog was previously cited as it describes a method including receiving a command to schedule a first process and selecting a command queue associated with the first process (see abstract).
Soliman was previously cited as it discloses new processor architecture called Mat-Core based on the use of multi-level ISA to explicitly communicate data parallelism to a processor in a compact way instead of the dynamic extraction using complex hardware or the static extraction using sophisticated compiler techniques. The performances of element-wise vector-vector addition, vector-matrix multiplication, and matrix-matrix multiplication are estimated on the decoupled Mat-Core processor. On a Multi-Mat-Core processor, performance would be improved by parallel processing threads of codes using multi-threading techniques (see abstract).  
In regards to applicants arguments on pages 8-10 that the cited McCrary and Hartog, and Soliman references does not disclose the amended limitation “wherein, based at least in part on a determination that a procedure is ready for processing, the scalar unit is configured to set up, independent of any external processor, the SIMD unit to execute the procedure by the scalar unit executing a persistent compute kernel until receiving a notification to stop executing the persistent compute kernel”, the Examiner agrees however, Rogers has now been cited as it teaches a method of processing commands that includes holding commands in queues and executing the commands in an order based on their respective priority (see abstract). 
With regards to the amended limitation, Rogers discloses that a command processor 310 retrieves commands from ring buffers 304-308 and sends them to processing core 312 for execution. RLC 311, under the direction of kernel mode driver 214, controls command processor 310 to switch between ring buffers of ring buffers 304-308….In alternate embodiments, RLC 311 controls command processor 310 to retrieve commands from ring buffers 304-308 according to other schemes. For example, command processor 310 can retrieve all commands from a higher priority buffer before moving on to retrieve commands from a lower priority buffer. The commands sent to the higher priority queue is interpreted as sending to the persistent queue messages mapped to persistent sub-tasks targeting a GPU co-processor (i.e. SIMD) as the command processor 310 sends the commands for execution within the processing core (see paragraphs [0040]-[0041]) as further disclosed in the rejections of the office action below. 
In regards to dependent claim 23, with respect to applicants arguments on pages 10-11 that the cited McCrary and Hartog references does not disclose the limitation “wherein the scalar unit is further configured, independent of any external processor by executing the persistent compute kernel, to select the procedure from a plurality of procedures for execution on the SIMD unit responsive to determining that: a message is stored in the persistent queue, wherein the persistent queue is further configured to store messages and the message maps to an event that specifies executing the procedure wherein the procedure comprises one or more sub-tasks selected for execution based on a mapping of the message to the event” the Examiner agrees however, Rogers has been cited as it teaches a command processor 310 can retrieve all commands from a higher priority buffer before moving on to retrieve commands from a lower priority buffer….command processor 310 and RLC 311 form a multithreaded system that can monitor the status of more than one ring buffer. The command processor 310 can be further configured to preempt command buffers being executed on processing core 312. Next, the reference details the command processor 310 allowing a command that is currently being executed to be completed, but prevents the execution of the next command in the command buffer from starting so that processor core 312 can execute the newly received high priority command (or command buffer), interpreted as the execution of the commands or sub-task regarding the high priority queue by the GPU coprocessor or SIMD (see paragraphs [0040]-[0041]), as further detailed in the rejections of the office action below.
In regards to dependent claim 27, with respect to applicants arguments on pages 11-12 that the cited McCrary and Hartog references does not disclose the limitation “wherein the scalar unit is further configured to issue more than one instruction in a single clock cycle to the SIMD unit” the Examiner agrees however, The Soliman reference has been cited as it teaches only three elements per register bank are available (two for reading and one for writing) during each clock cycle. A register bank can be used concurrently at most three instructions as the destination for one and the source of the other two. The concurrent processing of instructions within each clock cycle is interpreted as the issuing of more than one instruction in a single clock cycle as performed in the mat-core processing architecture (see “The Microarchitecture of the MAT-Core Processor” section page 97 continued), as further detailed in the rejections of the office action below.
In regards to independent claims 28 and 34, these claims recite limitations similar in scope to that of claim 21, and therefore remain rejected under the same rationale as provided above and further detailed in the rejections of the office action below.
	In regards to dependent claims 22, 24-26, 29-33, and 35-40, these claims depend from rejected base claims 21, 28, and 34, and therefore they remain rejected under the same rationale as provided above and further detailed in the rejections of the office action below.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.


The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 21, 22, 26, 28, 29, 33-35, and 39 are rejected under 35 U.S.C. 103 as being unpatentable over McCrary (US 2011/0050713 A1, hereinafter referenced “McCrary”) in view of Rogers (US 2012/0139930 A1, hereinafter referenced “Rogers”).


1-20. (Cancelled).  
  
In regards to claim 21. (New) McCrary discloses an apparatus (McCrary, Abstract)  comprising: 
-a single instruction multiple data (SIMD) unit comprising a plurality of processing elements configured to concurrently execute respective data items (McCrary, paragraphs [0004] and [0031]; Reference at paragraph [0031] discloses use of SIMD processor that works together with the RLC 140 as the SIMD contains ALU’s and processing units for performing geometry and vertex calculations (interpreted as plurality of processing units for concurrent data execution); 
-and a scalar unit (McCrary, paragraphs [0004] and [0031]; Reference at paragraph [0031] discloses use of SIMD processor that works together with the RLC 140 as the SIMD contains ALU’s and processing units for performing geometry and vertex calculations (interpreted as scalar unit)); 

McCrary does not explicitly disclose but Rodgers teaches
-wherein, based at least in part on a determination that a procedure is ready for processing, the scalar unit is configured to set up, independent of any external processor, the SIMD unit to execute the procedure by the scalar unit executing a persistent compute kernel until receiving a notification to stop executing the persistent compute kernel (Rogers, paragraphs [0040]-[0041]; Reference discloses a command processor 310 (i.e. internal processor) retrieves commands from ring buffers 304-308 and sends them to processing core 312 for execution. RLC 311, under the direction of kernel mode driver 214 (i.e. persistent compute kernel), controls command processor 310 to switch between ring buffers of ring buffers 304-308 (i.e. procedure is ready)….In alternate embodiments, RLC 311 controls command processor 310 to retrieve commands from ring buffers 304-308 according to other schemes. For example, command processor 310 can retrieve all commands from a higher priority buffer before moving on to retrieve commands from a lower priority buffer (interpreted as the concept of a persistent compute kernel being executed).  

In regards to claim 22. (New) McCrary in view of Rodgers teach the apparatus as recited in claim 21.
McCrary does not explicitly disclose but Rodgers teaches
-wherein, by executing the persistent compute kernel, the scalar unit is further configured to monitor a persistent queue (Rogers, paragraph [0040]; Reference discloses command processor 310 can retrieve all commands from a higher priority buffer before moving on to retrieve commands from a lower priority buffer….command processor 310 and RLC 311 (i.e. a scalar unit) form a multithreaded system that can monitor the status of more than one ring buffer) configured to store at least data for execution of procedures by the SIMD unit (Rogers, paragraphs [0010], [0034], and [0039]; Reference at paragraph [0010] discloses a processing device having multiple queues as a command processor retrieves commands from the set of queues which has a high priority queue  holding high priority commands. Paragraphs [0034] and [0039] describe the use of a kernel mode driver for retrieving commands of the virtual processing devices (interpreted as implementing a compute kernel) as paragraph [0039] details “For example, high priority commands (e.g., computational commands), mid priority commands (e.g., rendering commands), and low priority commands (e.g., background commands) can be held in queues of virtual devices 208, 210, and 212, respectively. Unlike GPU 204, however, GPU 302 includes multiple ring buffers (i.e. queues) to receive commands from CPU 202. In an embodiment, GPU 302 includes a ring buffer for each priority type (i.e. persistent queue as the commands are executed based on priority)).  
McCrary and Rogers are combinable because they are in the same field of endeavor regarding processor task management. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention for the hardware based GPU scheduling method of McCrary to include the priority based command execution method of Rogers in order to provide the user with a method for rescheduling and executing commands issued by a first processor such as a CPU on a second processor such as a GPU for enhancing performance of a system through prioritization and scheduling of commands by the GPU as taught by McCrary while incorporating the priority command execution as taught by Rogers allowing for use of a command processor that can access multiple queues in which commands are executed based on high, mid, and low priority thus allowing for priority commands to be obtained within a desirable latency applicable to reducing latency within graphics processing systems such as those taught in McCrary.

In regards to claim 23. (New) McCrary in view of Rodgers teach the apparatus as recited in claim 22.
McCrary does not explicitly disclose but Rodgers teaches
-wherein the scalar unit is further configured, independent of any external processor by executing the persistent compute kernel, to select the procedure from a plurality of procedures for execution on the SIMD unit responsive to determining that: a message is stored in the persistent queue, wherein the persistent queue is further configured to store messages (Rogers, paragraph [0041]; Reference discloses the command processor 310 can be further configured to preempt command buffers being executed on processing core 312. For example, if processing core 312 is executing commands included in a command buffer having a mid priority and command processor 310 can determine that a high priority command or command buffer has been entered into ring buffer 304, command processor 310 can preempt the command buffer being executed on processing core 312. The determining of the high priority command being entered into the ring buffer is interpreted as the detecting of the message in the persistent queue as the commands are an identified first sub-task which are to be executed by the processing core 312 which is alongside the command processor 310); 
-and the message maps to an event that specifies executing the procedure (Rogers, paragraph [0040]; Reference discloses the command processor 310 retrieves commands from ring buffers 304-308 and sends them to processing core 312 for execution. RLC 311, under the direction of kernel mode driver 214, controls command processor 310 to switch between ring buffers of ring buffers 304-308….In alternate embodiments, RLC 311 controls command processor 310 to retrieve commands from ring buffers 304-308 according to other schemes. For example, command processor 310 can retrieve all commands from a higher priority buffer before moving on to retrieve commands from a lower priority buffer. The commands sent to the higher priority queue is interpreted as sending to the persistent queue messages mapped to persistent sub-tasks or events targeting the GPU co processor as the command processor 310 sends the commands for execution within the processing core), wherein the procedure comprises one or more sub-tasks selected for execution based on a mapping of the message to the event (Rogers, paragraph [0041]; Reference discloses “specifically, command processor 310 allows a command that is currently being executed to be completed, but prevents the execution of the next command in the command buffer from starting so that processor core 312 can execute the newly received high priority command (or command buffer) (interpreted as the execution of the commands or sub-task regarding an event with respect to the high priority queue by the GPU coprocessor)).
McCrary and Rogers are combinable because they are in the same field of endeavor regarding processor task management. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention for the hardware based GPU scheduling method of McCrary to include the priority based command execution method of Rogers in order to provide the user with a method for rescheduling and executing commands issued by a first processor such as a CPU on a second processor such as a GPU for enhancing performance of a system through prioritization and scheduling of commands by the GPU as taught by McCrary while incorporating the priority command execution as taught by Rogers allowing for use of a command processor that can access multiple queues in which commands are executed based on high, mid, and low priority thus allowing for priority commands to be obtained within a desirable latency applicable to reducing latency within graphics processing systems such as those taught in McCrary.
In regards to claim 26. (New) McCrary in view of Rodgers teach the apparatus as recited in claim 22.
McCrary further discloses
-wherein to set up the SIMD unit, the scalar unit is further configured, independent of any external processor by executing the persistent compute kernel, to: fetch a sequence of instructions of the procedure; and send the instructions to the SIMD unit for execution (McCrary, paragraphs [0031] and [0032]; Reference at paragraph [0031] discloses Ring list controller (RLC) 140 and a command processor 150 as the RLC 140 works together with the command processor 150 for processing the ring buffers 140 in the GPU 104 as paragraph [0032] discloses the command processor generating commands to be executed in the GPU 104).  

In regards to claim 28. (New) McCrary discloses a method (McCrary, Abstract) comprising: 
-setting up a single time, by a host processor, a graphics processing unit (GPU) coprocessor McCrary, paragraphs [0031] and [0032]; Reference at paragraph [0031] discloses Ring list controller (RLC) 140 and a command processor 150 interpreted as a GPU coprocessor) comprises a single instruction multiple data (SIMD) unit comprising a plurality of processing elements configured to concurrently execute respective data items (McCrary, paragraphs [0004] and [0031]; Reference at paragraph [0031] discloses use of SIMD processor that works together with the RLC 140 as the SIMD contains ALU’s and processing units for performing geometry and vertex calculations (interpreted as plurality of processing units for concurrent data execution); 
-wherein, in response to determining that a procedure is ready for processing, setting up, independent of the host processor, the SIMD unit to execute the procedure by the GPU coprocessor executing a persistent compute kernel until receiving a notification to stop executing the persistent compute 3 /14Application Serial No. 17/181,300 - Filed February 22, 2021 kernel (Rogers, paragraphs [0040]-[0041]; Reference discloses a command processor 310 (i.e. internal processor) retrieves commands from ring buffers 304-308 and sends them to processing core 312 for execution. RLC 311, under the direction of kernel mode driver 214 (i.e. persistent compute kernel), controls command processor 310 to switch between ring buffers of ring buffers 304-308 (i.e. procedure is ready)….In alternate embodiments, RLC 311 controls command processor 310 to retrieve commands from ring buffers 304-308 according to other schemes. For example, command processor 310 can retrieve all commands from a higher priority buffer before moving on to retrieve commands from a lower priority buffer (interpreted as the concept of a persistent compute kernel being executed).  
McCrary and Rogers are combinable because they are in the same field of endeavor regarding processor task management. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention for the hardware based GPU scheduling method of McCrary to include the priority based command execution method of Rogers in order to provide the user with a method for rescheduling and executing commands issued by a first processor such as a CPU on a second processor such as a GPU for enhancing performance of a system through prioritization and scheduling of commands by the GPU as taught by McCrary while incorporating the priority command execution as taught by Rogers allowing for use of a command processor that can access multiple queues in which commands are executed based on high, mid, and low priority thus allowing for priority commands to be obtained within a desirable latency applicable to reducing latency within graphics processing systems such as those taught in McCrary.

In regards to claim 29. (New) McCrary in view of Rodgers teach the method as recited in claim 28.
McCrary does not explicitly disclose but Rodgers teaches
-further comprising monitoring, by the GPU coprocessor executing the persistent compute kernel, a persistent queue (Rogers, paragraph [0040]; Reference discloses command processor 310 can retrieve all commands from a higher priority buffer before moving on to retrieve commands from a lower priority buffer….command processor 310 and RLC 311 (i.e. a scalar unit) form a multithreaded system that can monitor the status of more than one ring buffer) configured to store at least data for execution of procedures by the SIMD unit (Rogers, paragraphs [0010], [0034], and [0039]; Reference at paragraph [0010] discloses a processing device having multiple queues as a command processor retrieves commands from the set of queues which has a high priority queue  holding high priority commands. Paragraphs [0034] and [0039] describe the use of a kernel mode driver for retrieving commands of the virtual processing devices (interpreted as implementing a compute kernel) as paragraph [0039] details “For example, high priority commands (e.g., computational commands), mid priority commands (e.g., rendering commands), and low priority commands (e.g., background commands) can be held in queues of virtual devices 208, 210, and 212, respectively. Unlike GPU 204, however, GPU 302 includes multiple ring buffers (i.e. queues) to receive commands from CPU 202. In an embodiment, GPU 302 includes a ring buffer for each priority type (i.e. persistent queue as the commands are executed based on priority)).    
McCrary and Rogers are combinable because they are in the same field of endeavor regarding processor task management. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention for the hardware based GPU scheduling method of McCrary to include the priority based command execution method of Rogers in order to provide the user with a method for rescheduling and executing commands issued by a first processor such as a CPU on a second processor such as a GPU for enhancing performance of a system through prioritization and scheduling of commands by the GPU as taught by McCrary while incorporating the priority command execution as taught by Rogers allowing for use of a command processor that can access multiple queues in which commands are executed based on high, mid, and low priority thus allowing for priority commands to be obtained within a desirable latency applicable to reducing latency within graphics processing systems such as those taught in McCrary.

In regards to claim 30. (New) McCrary in view of Rodgers teach the method as recited in claim 29.
McCrary does not explicitly disclose but Rodgers teaches
-further comprising selecting, by the GPU coprocessor executing the persistent compute kernel, the procedure from a plurality of procedures for execution on the SIMD unit responsive to determining that: a message is stored in the persistent queue, wherein the persistent queue is further configured to store messages (Rogers, paragraph [0041]; Reference discloses the command processor 310 can be further configured to preempt command buffers being executed on processing core 312. For example, if processing core 312 is executing commands included in a command buffer having a mid priority and command processor 310 can determine that a high priority command or command buffer has been entered into ring buffer 304, command processor 310 can preempt the command buffer being executed on processing core 312. The determining of the high priority command being entered into the ring buffer is interpreted as the detecting of the message in the persistent queue as the commands are an identified first sub-task which are to be executed by the processing core 312 which is alongside the command processor 310);  
-and the message maps to an event that specifies executing the procedure (Rogers, paragraph [0040]; Reference discloses the command processor 310 retrieves commands from ring buffers 304-308 and sends them to processing core 312 for execution. RLC 311, under the direction of kernel mode driver 214, controls command processor 310 to switch between ring buffers of ring buffers 304-308….In alternate embodiments, RLC 311 controls command processor 310 to retrieve commands from ring buffers 304-308 according to other schemes. For example, command processor 310 can retrieve all commands from a higher priority buffer before moving on to retrieve commands from a lower priority buffer. The commands sent to the higher priority queue is interpreted as sending to the persistent queue messages mapped to persistent sub-tasks or events targeting the GPU co processor as the command processor 310 sends the commands for execution within the processing core), wherein the procedure comprises one or more sub-tasks selected for execution based on a mapping of the message to the event (Rogers, paragraph [0041]; Reference discloses “specifically, command processor 310 allows a command that is currently being executed to be completed, but prevents the execution of the next command in the command buffer from starting so that processor core 312 can execute the newly received high priority command (or command buffer) (interpreted as the execution of the commands or sub-task regarding an event with respect to the high priority queue by the GPU coprocessor)).
McCrary and Rogers are combinable because they are in the same field of endeavor regarding processor task management. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention for the hardware based GPU scheduling method of McCrary to include the priority based command execution method of Rogers in order to provide the user with a method for rescheduling and executing commands issued by a first processor such as a CPU on a second processor such as a GPU for enhancing performance of a system through prioritization and scheduling of commands by the GPU as taught by McCrary while incorporating the priority command execution as taught by Rogers allowing for use of a command processor that can access multiple queues in which commands are executed based on high, mid, and low priority thus allowing for priority commands to be obtained within a desirable latency applicable to reducing latency within graphics processing systems such as those taught in McCrary.

In regards to claim 33. (New) McCrary in view of Rodgers teach the method as recited in claim 29.
McCrary further discloses
-wherein to set up the SIMD unit independent of any external processor, the method further comprises: fetching, by the GPU coprocessor executing the persistent compute kernel, a sequence of instructions of the procedure; and 4/14Application Serial No. 17/181,300 - Filed February 22, 2021 sending, by the GPU coprocessor executing the persistent compute kernel, the instructions to the SIMD unit for execution (McCrary, paragraphs [0031] and [0032]; Reference at paragraph [0031] discloses Ring list controller (RLC) 140 and a command processor 150 as the RLC 140 works together with the command processor 150 for processing the ring buffers 140 in the GPU 104 as paragraph [0032] discloses the command processor generating commands to be executed in the GPU 104).  

In regards to claim 34. (New) McCrary discloses a system (McCrary, Abstract) comprising: 
-a graphics processing unit (GPU) coprocessor comprising a single instruction multiple data (SIMD) unit comprising a plurality of processing elements configured to concurrently execute respective data items (McCrary, paragraphs [0004] and [0031]; Reference at paragraph [0031] discloses use of SIMD processor that works together with the RLC 140 as the SIMD contains ALU’s and processing units for performing geometry and vertex calculations (interpreted as plurality of processing units for concurrent data execution); 
-and a host processor configured to set up a single time the GPU coprocessor with a persistent compute kernel (Rogers, paragraphs [0040]-[0041]; Reference discloses a command processor 310 (i.e. host processor) retrieves commands from ring buffers 304-308 and sends them to processing core 312 for execution. RLC 311, under the direction of kernel mode driver 214 (i.e. persistent compute kernel), controls command processor 310 to switch between ring buffers of ring buffers 304-308….In alternate embodiments, RLC 311 controls command processor 310 to retrieve commands from ring buffers 304-308 according to other schemes. For example, command processor 310 can retrieve all commands from a higher priority buffer before moving on to retrieve commands from a lower priority buffer (interpreted as the concept of a persistent compute kernel being executed in a single time); 
-wherein, in response to determining that a procedure is ready for processing, the GPU coprocessor is configured to set up, independent of the host processor, the SIMD unit to execute the procedure by the GPU coprocessor executing a persistent compute kernel until receiving a notification to stop executing the persistent compute kernel (Rogers, paragraphs [0040]-[0041]; Reference discloses a command processor 310 (i.e. internal processor) retrieves commands from ring buffers 304-308 and sends them to processing core 312 for execution. RLC 311, under the direction of kernel mode driver 214 (i.e. persistent compute kernel), controls command processor 310 to switch between ring buffers of ring buffers 304-308 (i.e. procedure is ready)….In alternate embodiments, RLC 311 controls command processor 310 to retrieve commands from ring buffers 304-308 according to other schemes. For example, command processor 310 can retrieve all commands from a higher priority buffer before moving on to retrieve commands from a lower priority buffer (interpreted as the concept of a persistent compute kernel being executed).  
McCrary and Rogers are combinable because they are in the same field of endeavor regarding processor task management. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention for the hardware based GPU scheduling method of McCrary to include the priority based command execution method of Rogers in order to provide the user with a method for rescheduling and executing commands issued by a first processor such as a CPU on a second processor such as a GPU for enhancing performance of a system through prioritization and scheduling of commands by the GPU as taught by McCrary while incorporating the priority command execution as taught by Rogers allowing for use of a command processor that can access multiple queues in which commands are executed based on high, mid, and low priority thus allowing for priority commands to be obtained within a desirable latency applicable to reducing latency within graphics processing systems such as those taught in McCrary.

In regards to claim 35. (New) McCrary in view of Rodgers teach the system as recited in claim 34.
McCrary does not explicitly disclose but Rodgers teaches
-wherein the system further comprises a persistent queue (Rogers, paragraph [0040]; Reference discloses command processor 310 can retrieve all commands from a higher priority buffer before moving on to retrieve commands from a lower priority buffer….command processor 310 and RLC 311 (i.e. a scalar unit) form a multithreaded system that can monitor the status of more than one ring buffer) configured to store at least data for execution of procedures by the SIMD unit (Rogers, paragraphs [0010], [0034], and [0039]; Reference at paragraph [0010] discloses a processing device having multiple queues as a command processor retrieves commands from the set of queues which has a high priority queue  holding high priority commands. Paragraphs [0034] and [0039] describe the use of a kernel mode driver for retrieving commands of the virtual processing devices (interpreted as implementing a compute kernel) as paragraph [0039] details “For example, high priority commands (e.g., computational commands), mid priority commands (e.g., rendering commands), and low priority commands (e.g., background commands) can be held in queues of virtual devices 208, 210, and 212, respectively. Unlike GPU 204, however, GPU 302 includes multiple ring buffers (i.e. queues) to receive commands from CPU 202. In an embodiment, GPU 302 includes a ring buffer for each priority type (i.e. persistent queue as the commands are executed based on priority)).    
McCrary and Rogers are combinable because they are in the same field of endeavor regarding processor task management. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention for the hardware based GPU scheduling method of McCrary to include the priority based command execution method of Rogers in order to provide the user with a method for rescheduling and executing commands issued by a first processor such as a CPU on a second processor such as a GPU for enhancing performance of a system through prioritization and scheduling of commands by the GPU as taught by McCrary while incorporating the priority command execution as taught by Rogers allowing for use of a command processor that can access multiple queues in which commands are executed based on high, mid, and low priority thus allowing for priority commands to be obtained within a desirable latency applicable to reducing latency within graphics processing systems such as those taught in McCrary.

In regards to claim 36. (New) McCrary in view of Rodgers teach the system as recited in claim 35.
McCrary does not explicitly disclose but Rodgers teaches
-wherein: the persistent queue is further configured to store messages; and the GPU coprocessor is further configured, independent of the host processor by executing the persistent compute kernel, to select the procedure from a plurality of procedures for execution on the SIMD unit responsive to determining that: a message is stored in the persistent queue (Rogers, paragraph [0041]; Reference discloses the command processor 310 can be further configured to preempt command buffers being executed on processing core 312. For example, if processing core 312 is executing commands included in a command buffer having a mid priority and command processor 310 can determine that a high priority command or command buffer has been entered into ring buffer 304, command processor 310 can preempt the command buffer being executed on processing core 312. The determining of the high priority command being entered into the ring buffer is interpreted as the detecting of the message in the persistent queue as the commands are an identified first sub-task which are to be executed by the processing core 312 which is alongside the command processor 310); and the message maps to an event that specifies executing the procedure (Rogers, paragraph [0040]; Reference discloses the command processor 310 retrieves commands from ring buffers 304-308 and sends them to processing core 312 for execution. RLC 311, under the direction of kernel mode driver 214, controls command processor 310 to switch between ring buffers of ring buffers 304-308….In alternate embodiments, RLC 311 controls command processor 310 to retrieve commands from ring buffers 304-308 according to other schemes. For example, command processor 310 can retrieve all commands from a higher priority buffer before moving on to retrieve commands from a lower priority buffer. The commands sent to the higher priority queue is interpreted as sending to the persistent queue messages mapped to persistent sub-tasks or events targeting the GPU co processor as the command processor 310 sends the commands for execution within the processing core), wherein the procedure comprises one or more sub-tasks selected for execution based on a mapping of the message to the event (Rogers, paragraph [0041]; Reference discloses “specifically, command processor 310 allows a command that is currently being executed to be completed, but prevents the execution of the next command in the command buffer from starting so that processor core 312 can execute the newly received high priority command (or command buffer) (interpreted as the execution of the commands or sub-task regarding an event with respect to the high priority queue by the GPU coprocessor)).
McCrary and Rogers are combinable because they are in the same field of endeavor regarding processor task management. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention for the hardware based GPU scheduling method of McCrary to include the priority based command execution method of Rogers in order to provide the user with a method for rescheduling and executing commands issued by a first processor such as a CPU on a second processor such as a GPU for enhancing performance of a system through prioritization and scheduling of commands by the GPU as taught by McCrary while incorporating the priority command execution as taught by Rogers allowing for use of a command processor that can access multiple queues in which commands are executed based on high, mid, and low priority thus allowing for priority commands to be obtained within a desirable latency applicable to reducing latency within graphics processing systems such as those taught in McCrary.

In regards to claim 39. (New) McCrary in view of Rodgers teach the system as recited in claim 35.
McCrary further discloses
-wherein to set up the SIMD unit, the GPU coprocessor is further configured, independent of the host processor by executing the persistent compute kernel, to: fetch a sequence of instructions of the procedure; and send the instructions to the SIMD unit for execution (McCrary, paragraphs [0031] and [0032]; Reference at paragraph [0031] discloses Ring list controller (RLC) 140 and a command processor 150 as the RLC 140 works together with the command processor 150 for processing the ring buffers 140 in the GPU 104 as paragraph [0032] discloses the command processor generating commands to be executed in the GPU 104).   



Claims 24, 25, 27, 31, 32, 37, 38, and 40 are rejected under 35 U.S.C. 103 as being unpatentable over McCrary (US 2011/0050713 A1) in view of Rogers (US 2012/0139930 A1) as applied to claims 22, 29, and 35 above, and further in view of Soliman (2011 “Mat-Core: A Decoupled Matrix Core Extension for General-Purpose Processors, hereinafter referenced “Soliman”)

In regards to claim 24. (New) McCrary in view of Rodgers teach the apparatus as recited in claim 22.
McCrary and Rodgers does not explicitly disclose but Soliman teaches
-wherein to set up the SIMD unit, the scalar unit is further configured, independent of any external processor by executing the persistent compute kernel, to: allocate private vector register space with operands stored in the persistent queue to be used by the SIMD unit when executing the procedure (Soliman, Fig. 2 and Decoupled Mat-Core Architecture” section continued, pages 100-101; Reference at Fig. 2 discloses processing vector/matrix data on multiple (1-D) execution units (i.e. stored operands) as page 96 discloses a straightforward organization of a matrix unit having multiple read and write ports. Reference at pages 100-101 discloses the use of indexed memory access interpreted as allocation of private vector register space based on the indexing access).  
McCrary and Rogers are combinable because they are in the same field of endeavor regarding processor task management. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention for the hardware based GPU scheduling method of McCrary to include the priority based command execution method of Rogers in order to provide the user with a method for rescheduling and executing commands issued by a first processor such as a CPU on a second processor such as a GPU for enhancing performance of a system through prioritization and scheduling of commands by the GPU as taught by McCrary while incorporating the priority command execution as taught by Rogers allowing for use of a command processor that can access multiple queues in which commands are executed based on high, mid, and low priority thus allowing for priority commands to be obtained within a desirable latency applicable to reducing latency within graphics processing systems such as those taught in McCrary.
McCrary and Soliman are also combinable because they are in the same field of endeavor regarding processor task management. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention for the hardware based GPU scheduling method of McCrary, in view of the priority based command execution method of Rogers, to include the mat-core processor architecture of Soliman in order to provide the user with a method for rescheduling and executing commands issued by a first processor such as a CPU on a second processor such as a GPU for enhancing performance of a system through prioritization and scheduling of commands by the GPU as taught by McCrary while incorporating the priority command execution as taught by Rogers allowing for use of a command processor that can access multiple queues in which commands are executed based on high, mid, and low priority thus allowing for priority commands to be obtained within a desirable latency. Further incorporating the processor architecture as taught by Soliman would allow for the use of multi-level ISA to communicate data parallelism to a processor in a compact way instead of the dynamic extraction using complex hardware or the static extraction using sophisticated compiler techniques by providing more cores in a physical package, thus performance would be improved by parallel processing threads of codes using multi-threading techniques applicable to the general purpose processors as implemented in systems such as those taught in McCrary and Rogers.

In regards to claim 25. (New) McCrary in view of Rodgers teach the apparatus as recited in claim 24.
McCrary and Rodgers does not explicitly disclose but Soliman teaches
-wherein the scalar unit is further configured, independent of any external processor by executing the persistent compute kernel, to: release the allocated private vector register space responsive to determining the SIMD unit has finished executing the procedure (Soliman, Decoupled Mat-Core Architecture” section continued, pages 100-101; Reference at pages 100-101 discloses the use of indexed memory access for storing and loading elements per clock cycle interpreted as allocation of private vector register space based on the indexing access).  
McCrary and Soliman are also combinable because they are in the same field of endeavor regarding processor task management. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention for the hardware based GPU scheduling method of McCrary, in view of the priority based command execution method of Rogers, to include the mat-core processor architecture of Soliman in order to provide the user with a method for rescheduling and executing commands issued by a first processor such as a CPU on a second processor such as a GPU for enhancing performance of a system through prioritization and scheduling of commands by the GPU as taught by McCrary while incorporating the priority command execution as taught by Rogers allowing for use of a command processor that can access multiple queues in which commands are executed based on high, mid, and low priority thus allowing for priority commands to be obtained within a desirable latency. Further incorporating the processor architecture as taught by Soliman would allow for the use of multi-level ISA to communicate data parallelism to a processor in a compact way instead of the dynamic extraction using complex hardware or the static extraction using sophisticated compiler techniques by providing more cores in a physical package, thus performance would be improved by parallel processing threads of codes using multi-threading techniques applicable to the general purpose processors as implemented in systems such as those taught in McCrary and Rogers.

In regards to claim 27. (New) McCrary in view of Rodgers teach the apparatus as recited in claim 26.
McCrary and Rodgers does not explicitly disclose but Soliman teaches
-wherein the scalar unit is further configured to issue more than one instruction in a single clock cycle to the SIMD unit (Soliman, Fig. 3 and “The Microarchitecture of the MAT-Core Processor” page 97 continued; Reference discloses Instead, only three elements per register bank are available (two for reading and one for writing) during each clock cycle. A register bank can be used concurrently at most three instructions as the destination for one and the source of the other two. The concurrent processing of instructions within each clock cycle is interpreted as the issuing of more than one instruction in a single clock cycle as performed in the mat-core processing architecture).  
McCrary and Soliman are also combinable because they are in the same field of endeavor regarding processor task management. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention for the hardware based GPU scheduling method of McCrary, in view of the priority based command execution method of Rogers, to include the mat-core processor architecture of Soliman in order to provide the user with a method for rescheduling and executing commands issued by a first processor such as a CPU on a second processor such as a GPU for enhancing performance of a system through prioritization and scheduling of commands by the GPU as taught by McCrary while incorporating the priority command execution as taught by Rogers allowing for use of a command processor that can access multiple queues in which commands are executed based on high, mid, and low priority thus allowing for priority commands to be obtained within a desirable latency. Further incorporating the processor architecture as taught by Soliman would allow for the use of multi-level ISA to communicate data parallelism to a processor in a compact way instead of the dynamic extraction using complex hardware or the static extraction using sophisticated compiler techniques by providing more cores in a physical package, thus performance would be improved by parallel processing threads of codes using multi-threading techniques applicable to the general purpose processors as implemented in systems such as those taught in McCrary and Rogers.

In regards to claim 31. (New) McCrary in view of Rodgers teach the method as recited in claim 29.
McCrary and Rodgers does not explicitly disclose but Soliman teaches
-wherein to set up the SIMD unit independent of any external processor, the method further comprises: allocating, by the GPU coprocessor executing the persistent compute kernel, private vector register space with operands stored in the persistent queue to be used by the SIMD unit when executing the procedure (Soliman, Fig. 2 and Decoupled Mat-Core Architecture” section continued, pages 100-101; Reference at Fig. 2 discloses processing vector/matrix data on multiple (1-D) execution units (i.e. stored operands) as page 96 discloses a straightforward organization of a matrix unit having multiple read and write ports. Reference at pages 100-101 discloses the use of indexed memory access interpreted as allocation of private vector register space based on the indexing access).  
McCrary and Soliman are also combinable because they are in the same field of endeavor regarding processor task management. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention for the hardware based GPU scheduling method of McCrary, in view of the priority based command execution method of Rogers, to include the mat-core processor architecture of Soliman in order to provide the user with a method for rescheduling and executing commands issued by a first processor such as a CPU on a second processor such as a GPU for enhancing performance of a system through prioritization and scheduling of commands by the GPU as taught by McCrary while incorporating the priority command execution as taught by Rogers allowing for use of a command processor that can access multiple queues in which commands are executed based on high, mid, and low priority thus allowing for priority commands to be obtained within a desirable latency. Further incorporating the processor architecture as taught by Soliman would allow for the use of multi-level ISA to communicate data parallelism to a processor in a compact way instead of the dynamic extraction using complex hardware or the static extraction using sophisticated compiler techniques by providing more cores in a physical package, thus performance would be improved by parallel processing threads of codes using multi-threading techniques applicable to the general purpose processors as implemented in systems such as those taught in McCrary and Rogers.

In regards to claim 32. (New) McCrary in view of Rodgers in further view of Soliman teach the method as recited in claim 31.
McCrary and Rodgers does not explicitly disclose but Soliman teaches
-further comprising: releasing, by the GPU coprocessor executing the persistent compute kernel, the allocated private vector register space responsive to determining the SIMD unit has finished executing the procedure (Soliman, Decoupled Mat-Core Architecture” section continued, pages 100-101; Reference at pages 100-101 discloses the use of indexed memory access for storing and loading elements per clock cycle interpreted as allocation of private vector register space based on the indexing access).  
McCrary and Soliman are also combinable because they are in the same field of endeavor regarding processor task management. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention for the hardware based GPU scheduling method of McCrary, in view of the priority based command execution method of Rogers, to include the mat-core processor architecture of Soliman in order to provide the user with a method for rescheduling and executing commands issued by a first processor such as a CPU on a second processor such as a GPU for enhancing performance of a system through prioritization and scheduling of commands by the GPU as taught by McCrary while incorporating the priority command execution as taught by Rogers allowing for use of a command processor that can access multiple queues in which commands are executed based on high, mid, and low priority thus allowing for priority commands to be obtained within a desirable latency. Further incorporating the processor architecture as taught by Soliman would allow for the use of multi-level ISA to communicate data parallelism to a processor in a compact way instead of the dynamic extraction using complex hardware or the static extraction using sophisticated compiler techniques by providing more cores in a physical package, thus performance would be improved by parallel processing threads of codes using multi-threading techniques applicable to the general purpose processors as implemented in systems such as those taught in McCrary and Rogers.
In regards to claim 37. (New) McCrary in view of Rodgers teach the system as recited in claim 35.
McCrary and Rodgers does not explicitly disclose but Soliman teaches
-wherein to set up the SIMD unit, the GPU coprocessor is further configured, independent of the host processor by executing the persistent compute kernel, to: allocate private vector register space with operands stored in the persistent queue to be used by the SIMD unit when executing the procedure (Soliman, Fig. 2 and Decoupled Mat-Core Architecture” section continued, pages 100-101; Reference at Fig. 2 discloses processing vector/matrix data on multiple (1-D) execution units (i.e. stored operands) as page 96 discloses a straightforward organization of a matrix unit having multiple read and write ports. Reference at pages 100-101 discloses the use of indexed memory access interpreted as allocation of private vector register space based on the indexing access).   
McCrary and Soliman are also combinable because they are in the same field of endeavor regarding processor task management. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention for the hardware based GPU scheduling method of McCrary, in view of the priority based command execution method of Rogers, to include the mat-core processor architecture of Soliman in order to provide the user with a method for rescheduling and executing commands issued by a first processor such as a CPU on a second processor such as a GPU for enhancing performance of a system through prioritization and scheduling of commands by the GPU as taught by McCrary while incorporating the priority command execution as taught by Rogers allowing for use of a command processor that can access multiple queues in which commands are executed based on high, mid, and low priority thus allowing for priority commands to be obtained within a desirable latency. Further incorporating the processor architecture as taught by Soliman would allow for the use of multi-level ISA to communicate data parallelism to a processor in a compact way instead of the dynamic extraction using complex hardware or the static extraction using sophisticated compiler techniques by providing more cores in a physical package, thus performance would be improved by parallel processing threads of codes using multi-threading techniques applicable to the general purpose processors as implemented in systems such as those taught in McCrary and Rogers.

In regards to claim 38. (New) McCrary in view of Rodgers in further view of Soliman teach the system as recited in claim 37.
McCrary and Rodgers does not explicitly disclose but Soliman teaches
-wherein the GPU coprocessor is further configured, independent of the host processor by executing the persistent compute kernel, to: release the allocated private vector register space responsive to determining the SIMD unit has finished executing the procedure (Soliman, Decoupled Mat-Core Architecture” section continued, pages 100-101; Reference at pages 100-101 discloses the use of indexed memory access for storing and loading elements per clock cycle interpreted as allocation of private vector register space based on the indexing access).  
McCrary and Soliman are also combinable because they are in the same field of endeavor regarding processor task management. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention for the hardware based GPU scheduling method of McCrary, in view of the priority based command execution method of Rogers, to include the mat-core processor architecture of Soliman in order to provide the user with a method for rescheduling and executing commands issued by a first processor such as a CPU on a second processor such as a GPU for enhancing performance of a system through prioritization and scheduling of commands by the GPU as taught by McCrary while incorporating the priority command execution as taught by Rogers allowing for use of a command processor that can access multiple queues in which commands are executed based on high, mid, and low priority thus allowing for priority commands to be obtained within a desirable latency. Further incorporating the processor architecture as taught by Soliman would allow for the use of multi-level ISA to communicate data parallelism to a processor in a compact way instead of the dynamic extraction using complex hardware or the static extraction using sophisticated compiler techniques by providing more cores in a physical package, thus performance would be improved by parallel processing threads of codes using multi-threading techniques applicable to the general purpose processors as implemented in systems such as those taught in McCrary and Rogers.

In regards to claim 40. (New) McCrary in view of Rodgers teach the system as recited in claim 39.
McCrary and Rodgers does not explicitly disclose but Soliman teaches
-wherein the GPU coprocessor is further configured to issue more than one instruction in a single clock cycle to the SIMD unit (Soliman, Fig. 3 and “The Microarchitecture of the MAT-Core Processor” page 97 continued; Reference discloses Instead, only three elements per register bank are available (two for reading and one for writing) during each clock cycle. A register bank can be used concurrently at most three instructions as the destination for one and the source of the other two. The concurrent processing of instructions within each clock cycle is interpreted as the issuing of more than one instruction in a single clock cycle as performed in the mat-core processing architecture).
McCrary and Soliman are also combinable because they are in the same field of endeavor regarding processor task management. It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention for the hardware based GPU scheduling method of McCrary, in view of the priority based command execution method of Rogers, to include the mat-core processor architecture of Soliman in order to provide the user with a method for rescheduling and executing commands issued by a first processor such as a CPU on a second processor such as a GPU for enhancing performance of a system through prioritization and scheduling of commands by the GPU as taught by McCrary while incorporating the priority command execution as taught by Rogers allowing for use of a command processor that can access multiple queues in which commands are executed based on high, mid, and low priority thus allowing for priority commands to be obtained within a desirable latency. Further incorporating the processor architecture as taught by Soliman would allow for the use of multi-level ISA to communicate data parallelism to a processor in a compact way instead of the dynamic extraction using complex hardware or the static extraction using sophisticated compiler techniques by providing more cores in a physical package, thus performance would be improved by parallel processing threads of codes using multi-threading techniques applicable to the general purpose processors as implemented in systems such as those taught in McCrary and Rogers.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure: See the Notice of References Cited (PTO-892)
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to TERRELL M ROBINSON whose telephone number is (571)270-3526. The examiner can normally be reached 8am-5pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kent Chang can be reached on 571-272-7667. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/TERRELL M ROBINSON/Examiner, Art Unit 2619