DETAILED ACTION
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This Office Action is in response to application filed on 10/12/2021, wherein claims 1-12, 14-21 are pending.  Claims 1-4, 6, 8, 11, 14-15, 17 are amended, claim 13 is cancelled, and claim 21 is newly added.

Priority
Applicant’s claim for the benefit of a prior-filed application under 35 U.S.C. 119(e) or under 35 U.S.C. 120, 121, 365(c), or 386(c) is acknowledged. Applicant has not complied with one or more conditions for receiving the benefit of an earlier filing date under 35 U.S.C. 120 as follows:
The later-filed application must be an application for a patent for an invention which is also disclosed in the prior application (the parent or original nonprovisional application or provisional application). The disclosure of the invention in the parent application and in the later-filed application must be sufficient to comply with the requirements of 35 U.S.C. 112(a) or the first paragraph of pre-AIA  35 U.S.C. 112, except for the best mode requirement.  See Transco Products, Inc. v. Performance Contracting, Inc., 38 F.3d 551, 32 USPQ2d 1077 (Fed. Cir. 1994)
The disclosure of the prior-filed application, Application No. 15164848, fails to provide adequate support or enablement in the manner provided by 35 U.S.C. 112(a) or pre-AIA  35 U.S.C. 112, first paragraph for one or more claims of this application.  In particular, it does not provide support for “a scheduler comprising plurality of stages, … .
In addition, the disclosure of the prior-filed application, Application No. 16/270766, fails to provide adequate support or enablement in the manner provided by 35 USC 112(a) or pre-AIA  35 USC 112, first paragraph for one or more claims of this application.  In particular, it does not provide support for “…the output command buffer allocator and initializer further comprise an output command buffer write pointer update .

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim 1-3, 6-12, 14, and 17-20 are rejected under 35 U.S.C. 103 as being unpatentable over Duluk, JR. et al. (US PGPUB 2014/0176588) (hereafter as Duluk), in view of Burke (US PGPUB 2018/0300933).

As for claim 1, Duluk teaches a graph stream processing system (Abstract, “…”graphics processing unit…”), comprising:
a plurality of graph streaming processors operative to process a plurality of threads (paragraph 41, “…one or more…SM…each Sm…configured to process one or more thread groups…”), wherein each of the plurality of threads include a set of instructions operating on the plurality of graph streaming processors and operating on a set of input data and producing output data (paragraph 41, “…receive instructions...[to execute on] SM 310…” and paragraph 44, “…the input data set a thread is to process…an output dataset a thread is to produce or write…”); and
a scheduler comprising plurality of stages (Fig. 4 and paragraphs 40-41 56,  Fig. 4 depicts Graphics processing pipeline includes a sequence of processing stages.  Stages can be implemented within GPU 208 utilizing SMs 310 and other processing engines.  (Paragraph 56).  “operations of GPC 208 is advantageously controlled via a pipeline manager…that distributes processing tasks to one or more streaming multiprocessors (SMs) 310)” and “Each SM 310 includes…a warp scheduler…receives instructions….” (paragraph 40-41).   Thus, the graphics processing pipeline is implemented with a pipeline manager that distributes tasks for each stage to be implemented in the SMs310 that implement each stage.  Each SM has a warp scheduler which is constructively understood as the scheduler component for the 
wherein each of the stages is coupled to an input command buffer and an output command buffer, wherein the input command buffer of each stage holds commands [graphics primitives] for the stage, and the output command buffer of each stage holds commands [graphics primitives] for a next stage of the plurality of stages (paragraph 39, 58-59, 63, 65 and 72 in view of paragraph 35.  First, while applicant claims an output buffer and an input buffer, as understood in the specification, they are the same buffer, labeled differently based on point of view of the output generating stage or the subsequent stage.  Moreover, current application states “…input command buffers 120, 124 store the index pointing to the data buffer, index to the first input command buffer (120) connected to the compiler is provided by the compiler, subsequent indices are written by the graphic streaming processor array.  Stage 112 reads the command buffers 120 and schedules a thread…the index to the data for execution of code by the processor array 106 is stored in command buffer 120…” (Specification, paragraph 71).   Thus, under the broadest reasonable interpretation.  command of command buffer can be understood as an objects that triggers tasks to further process the objects, including graphical objects.   Here, “GPCs…processing task may generate one or more ‘child’ processing tasks during execution…” (paragraph 39).  In specific embodiment, “…Primitive assembler 420…constructs graphics primitives for processing by geometry processing unit 425… geometry processing unit 425 maybe programmed to subdivide the graphics primitives into one or more new graphics primitives and calculate parameters…generate additional graphics primitives or one or more geometry objects 
wherein each command generates at least some of the plurality of threads (paragraph 40-41.  “…a thread refers to an instance of a particular program executing on a particular set of input data…”  and “…via a pipeline manager that distributes processing tasks to one or more streaming multiprocessors (SMs) 310, where each SM 310 configured to process one or more thread groups…” in view of paragraphs 56 and 83),
each stage including physical hardware implemented using digital logic gates, operative to schedule each of the threads, each stage comprising of a command parser, a thread generator and a thread scheduler (Fig. 3 – Warp Scheduler and Instruction Unit 312, and paragraph 41-42, Current application’s parser, generator, and scheduler are functional units of an overall scheduler for a stage (See, Fig. 1B), and are not separate executables, thus, they are understood to include functions performed within an overall scheduler unit.  Here, “…warp scheduler and instruction unit 312 receives 
the command parser operative to interpret commands within a corresponding input command buffer (paragraph 41),
the thread scheduler, coupled to the thread generator operative to dispatch the plurality of threads for operating on the plurality of graph streaming processors, with one or more threads running one or more code blocks on different input data and producing different output data (paragraph 42, “…warp…executing the same program on different input data…”).

Duluk teaches receiving instructions which are then executed as threads on the SMs, thus, it would have been obvious Duluk would have a functional unit coupled to the command parser to operate to generate the plurality of threads to execute on the SM because doing so enables executing of threads on hardware from instructions as taught by prior art.  However, in the interest of compact prosecution, Examiner note Duluk does not state a thread generator coupled to the command parser operate to generate the plurality of threads.
However, Burke teaches a known method of instruction execution for graphics including the thread generator coupled to the command parser operative to generate the plurality of threads (paragraph 74, “…the instruction unit 254 can dispatch 
One of ordinary skill in the art before the effective filing date of the application would have recognized that applying the known technique of Burke would have yielded predictable results and resulted in an improved system.  It would have been recognized that applying the technique of Burke to the teachings of Duluk would have yielded predictable results because the level of ordinary skill in the art demonstrated by the references applied shows the ability to incorporate such instruction processing features into similar systems.  Further, applying thread generator coupled to the command parser operate to generate the plurality of threads to Duluk with command parser and thread scheduler that dispatches commands from instruction cache to execute as threads on processing cores accordingly, would have been recognized by those of ordinary skill in the art as resulting in an improved system that would allow improved parallel execution of instructions across multiple processors. (Burke, paragraph 74)

As for claim 2, Burke also teaches wherein the plurality of graph streaming processors simultaneously operates on a plurality of threads of different stages (paragraph 213, Output of one stage can be send to another stage/unit that already exists.  Thus, plurality of executing threads for different stages clearly coexist, and are not limited to sequentially deployed.  See, e.g., paragraphs 176 and 215).

As for claim 3, Duluk teaches each stage of the scheduler further comprises an output command buffer allocator and initialization (paragraphs 87, 90-91).
Burke also teaches the scheduler further comprising an output command buffer allocator and initializer to manage output command buffer size [return buffer size] (paragraph 305) and clearing of output command buffer before scheduling a thread for processing by the plurality of graph streaming processors (paragraph 301.  those with pipeline flush necessarily finish and clearing of output command buffer before scheduling a next work).

As for claim 6, Duluk also teaches the plurality of graph streaming processors operating on a thread generate write commands to update the output command buffer of each stage (paragraph 64-65, the local vertex buffer is written to, and the data is subsequently transferred.  Thus, it would be inherent there is a trigger to update (i.e., write) to the output command buffer).

As for claim 7, Burke teaches wherein the plurality of graph streaming processors complete operation on at least one thread of a first stage before the thread scheduler can dispatch threads from a second stage for operation, wherein operations on the threads of the second stage start after the operations on the at least one thread of the first stage (paragraph 304.  pipeline synchronization where clear of data from one or more instructions before processing a next set of commands (next stage’s actions) is understood as waiting for complete operation on at least one thread of a first stage).

As for claim 8, Duluk also teaches wherein commands to generate threads for the second stage are computed by the plurality of graph streaming processors operating on the at least one of threads of the first stage (paragraph 39.  tasks generating one or more child processing tasks is understood as generating tasks that follows the task, which is understood as a different stage).

As for claim 9, Burke also teaches the graph streaming processor system further comprising a compiler to generate the one or more code blocks for operating on the plurality of graph streaming processors (paragraph 152, “compiler”).

As for claim 10, Burke also teaches the compiler provides input commands to initiate processing of the graph streaming processor system (paragraph 316.  pre-compilation is understood as providing the instructions that the graph stream processor system processes on.  the start of a compiled program is understood as commands to initiate processing).

As for claim 11, Duluk teaches a method of graph stream processing (Abstract, “…graphics processing unit…”), comprising:
processing, by a plurality of graph streaming processors, a plurality of threads (paragraph 41, “…one or more…SM…each Sm…configured to process one or more thread groups…”), wherein each of the plurality of threads include a set of instructions operating on the plurality of graph streaming processors operating on a set of input data 
scheduling the plurality of threads by a scheduler (paragraph 41, the pipeline managed by pipeline manager is understood as different stages of a pipeline, wherein each SM process one or more thread groups using a warp scheduler within the SM.  both the pipeline manager and each stage’s warp scheduler are part of the scheduler that schedule the plurality of threads),
wherein the scheduler includes a plurality of stages (Paragraphs 41.  The pipeline managed by pipeline manager is understood as different stages of a pipeline, wherein each SM process one or more thread groups using a warp scheduler within the SM.  Thus, the scheduler comprises multiple stages), see, paragraph 4, “graphics processing pipeline that includes a sequence of graphics processing stages”)), 
wherein each of the stages is coupled to an input command buffer and an output command buffer, wherein the input command buffer of each stage holds commands [graphics primitives] for the stage, and the output command buffer of each stage holds commands [graphics primitives] for a next stage of the plurality of stages (paragraph 58-59, 63, 65 and 72 in view of paragraph 35.  Current application states “…input command buffers 120, 124 store the index pointing to the data buffer, index to the first input command buffer (120) connected to the compiler is provided by the compiler, subsequent indices are written by the graphic streaming processor array.  Stage 112 reads the command buffers 120 and schedules a thread…the index to the data for execution of code by the processor array 106 is stored in command buffer 120…” 
wherein each command generates at least some of the plurality of threads (paragraph 40-41.  “…a thread refers to an instance of a particular program executing on a particular set of input data…”  and “…via a pipeline manager that distributes 
wherein each stage includes physical hardware operative to schedule each of the threads (Fig. 3 – Warp Scheduler and Instruction Unit 312, and paragraph 41-42),
further comprising:
interpreting by the scheduler, commands within a corresponding input command buffer (paragraph 41),
dispatching by the thread scheduler, one or more threads for operating on the plurality of graph streaming processors, with each thread threads running one or more code blocks on different input data and producing different output data (paragraph 42, “…warp…executing the same program on different input data…”).

Duluk teaches receiving instructions which are then executed as threads on the SMs, thus, it would have been obvious Duluk would have a functional unit coupled to the command parser to operate to generate the plurality of threads to execute on the SM because doing so enables executing of threads on hardware from instructions as taught by prior art.  However, in the interest of compact prosecution, Examiner note Duluk does not state scheduler generate the one or more threads.
However, Burke teaches a known method of instruction execution for graphics including generating by a thread scheduler, one or more threads (paragraph 74, “…the instruction unit 254 can dispatch instructions as thread group (e.g., wraps), with each thread of the thread group assigned to a different execution unit within GPGPU core 
One of ordinary skill in the art before the effective filing date of the application would have recognized that applying the known technique of Burke would have yielded predictable results and resulted in an improved system.  It would have been recognized that applying the technique of Burke to the teachings of Duluk would have yielded predictable results because the level of ordinary skill in the art demonstrated by the references applied shows the ability to incorporate such instruction processing features into similar systems.  Further, applying generating by the thread scheduler, one or more threads to Duluk with command parser and thread scheduler that dispatches commands from instruction cache to execute as threads on processing cores accordingly, would have been recognized by those of ordinary skill in the art as resulting in an improved system that would allow improved parallel execution of instructions across multiple processors. (Burke, paragraph 74)

As for claim 12, Burke also teaches two or more stages of the plurality of stages operate simultaneously (paragraph 301, “…sequence…will process…in at least partial concurrence…”).

As for claim 13, wherein the plurality of graph streaming processors complete operation on a plurality of threads of a node corresponding to a stage at the same time (paragraph 291, “…entire geometric objects…”  The specification does not explain the 

As for claims 14, 17-20, they contain similar limitations as claims 3, and 6-9 respectively.  Thus, they are rejected under the same rationales.

Claim 4-5, 15-16, and 21 are rejected under 35 U.S.C. 103 as being unpatentable over Duluk and Burke, further in view of Mizrahi et al. (US PGPUB 2016/0291982).

As for claim 4, Duluk and Burke teaches thread processors returns after processing dispatched task and synchronization of data between stages, thus, it would have been obvious the system can do status reporting and completion indication.  Nevertheless, in the interest of compact prosecution, examiner note they do not explicitly teach a write pointer update and indicating a completion pointer for a next stage
However, Mizrahi teaches a known method of execution of instruction sequences on processor threads including thread processors [2nd hardware thread] including an output command buffer write pointer update to update a write pointer (WP) during the clearing of output command buffer, further the write pointer indicating a completion pointer for a next stage (Fig. 5 – steps 114-118 and paragraphs 97-99.  “if the instruction is found to be the last write operation…thread signals the last write, at a LWI 
One of ordinary skill in the art before the effective filing date of the application would have recognized that applying the known technique of Mizrahi would have yielded predictable results and resulted in an improved system.  It would have been recognized that applying the technique of Mizrahi to the teachings of Duluk and Burke would have yielded predictable results because the level of ordinary skill in the art demonstrated by the references applied shows the ability to incorporate such instruction processing features into similar systems.  Further, applying write update and write completion indicator of output command buffer to Duluk and Burke with thread execution scheduling based on data dependency between stages and output command buffer accordingly, would have been recognized by those of ordinary skill in the art as resulting in an improved system that would allow improved parallel execution of multiple tasks. (Mizrahi, paragraph 9)

As for claim 5, Duluk teaches plurality of graph streaming processors (Abstract).  Mizrahi teaches processors updates the completion pointer after completing operation on a thread (paragraph 98).

As for claims 15-16, they contain similar limitations as claims 4-5 above.  Thus, they are rejected under the same rationales.
As for claim 21, Duluk teaches the plurality of threads includes a plurality of nodes organized according to a logical network topology, each node including a code block of the one or more code blocks (paragraph 40-41, 83, and 87 in view of Fig. 4.  Each stage of Fig. 4 is implemented using GPC’s SMs, running instructions/code on threads.  Moreover, “logical network topology” was not explicitly disclosed.  Instead, only mention of topology is at Specification paragraph 76, “topology comprises of nodes, data buffers, command buffers and constants buffers.  Here, the graphics pipeline clearly comprises of nodes, data buffers (vertices data) and command buffers containing index to graphics primitives.)
Mizrahi also teaches dispatching is triggered by execution of a special instruction embedded in the code block and updating the index to an output command buffer of a stage that dispatched a thread instance of the code block, the index storing location in the data buffer for receiving data by the at least one of the plurality of nodes (Fig. 5 – steps 114-118 and paragraphs 97-99.  “…if the instruction is found to be the last write operation…thread signals the last write, at a LWI signaling step 118...” and “…transfer a pointer that points to a location holding the register value, instead of transferring the value itself…”.  Examiner note, while labeled as “special instruction”, Specification of present application merely teaches the instruction “indicates to the scheduler to schedule the next thread for execution…” (Specification, paragraph.  Thus, BRI includes any instruction that functionally indicates end of work for a stage to start a thread execution for a next stage/thread/work.  Here, the last write instruction functionally serves as indicator of the end of the writes for the stage and triggers the LWI signaling .


Response to Arguments
Applicant's arguments filed on 10/12/2021 have been fully considered but they are not persuasive. 
Applicant argues in the remark:
Argument I, “The sited references do not…teach…each of the stages is coupled to an input command buffer and an output command buffer…Examiner equates the L1 cache 320 of Duluk with input and output command buffer…as claimed, the input command buffer of each stage holds commands for the stage…the output command buffer of each stage holds commands for a next stage of the plurality of stages…” (App. Arg. Pg. 11).
Argument II: “The cited references do not … teach…each command generates at least some of the plurality of threads….L1 cache 320 of Duluk store shared data and instructions, however the input and output command buffers of claim 1 store (hold) commands…each command generates at least some of the plurality of threads…The stored data and instructions of the L1 cache of Duluk do not generate threads.”  (App. Arg. Pg. 11-12).
Argument III, “Cited references do not…teach…each stage including physical hardware implemented using digital logic gates operative to schedule each of the threads…” (App. Arg. Pg. 12).
Argument IV, “Claim 3…Burke teaches one command sequence at the top of the pipeline but does not teach or suggest a scheduler having a plurality of stages with each stage having an input and output command buffer...” (App. Arg. Pg. 12-13).
Argument V, “Claim 4…Burke teaches one command sequence at the top of the pipeline but does not teach or suggest a scheduler having a plurality of stages with each stage having an input and output command buffer…Mizrahi teaches write instructions and managing the data dependency between two running threads.  None of the cited references teach or suggest updating a write pointer (WP) during the clearing of the output command buffer, further the write pointer indicating a completion pointer for a next stage…” (App. Arg. Pg. 13).
Examiner respectfully disagrees for the following reasons:
As for Argument I, see paragraph 9 above.  In addition, Examiner note, prior art teaches GPCs 208 can execute a task, and may generate one or more “child” processing tasks during execution to be scheduled (paragraph 39).  Prior art then teaches specific exemplary embodiment where geometry processing unit perform the task of geometry process, which generates new graphics primitives passed to subsequent stages for performing specific tasks of the subsequent stage.  (Paragraph 58-59).  Both “tasks” and “graphics primitives” as disclosed in the prior art clearly perform the function of triggering computations, thus they are functionally commands that triggers specific computational actions.  Here, Applicant does not specify any particular commands generated, thus, any object 
As for argument II, see paragraph 9 above.  In addition, Examiner note prior art explicitly states receiving instructions (or commands) which are dispatched and executed as threads of a thread group by wrap scheduler (paragraph 41).  Thus, the instructions clearly leads to the generation of threads to execute the instructions that are executed by hardware execution units (i.e., Fig. 3 –item 302).  Applicant’s assertion that somehow instruction in Duluk do not generate threads seems to imply the command/instruction needs to generate threads themselves, which is unknown in the art and not taught in the Specification.  Indeed, similar to the prior art, present application teaches a parser that interprets the commands, and the parser then schedules the threads to implement the command (paragraph 70).  Indeed, applicant specification teaches a functionality well known, and common to most systems, a task/command is scheduled for implementation in a multi-threaded environment by a scheduler.  Thus, Applicant’s argument is not persuasive.
 As for argument III, see, paragraph 9 above.  In addition, Examiner note that prior art teaches plurality of graphical processing stages (Fig. 4), each can be implemented by SMs within GPC (Paragraph 56), and each SM includes a warp scheduler.  Since the SM implements a stage, the wrap scheduler is constructively understood as the stage scheduler, and SM is clearly a physical hardware.  Thus, applicant’s argument is unpersuasive.
As for argument IV, see paragraph 11 above.  In addition.  Examiner note Duluk explicitly teaches plurality of graphical processing stages, each stage can produce tasks for the next stage to perform, and the existence of a buffer that is used to store the output of one stage (and by extension, input for the next stage).  (Paragraphs 35, 39, 58-59, 83, 87 and 99).  Thus.  Applicant’s argument focusing on Burke is unpersuasive.
As for argument V, see paragraph 12 above.  In addition, Examiner note regarding the argument on Burke, the same response from item i above applies and is similarly unpersuasive.  With respect to the argument on Mizrahi, Applicant’s argument is not based on claimed features.  In particular, examiner note applicant’s citation of pointer is an object in many programming languages that stores a memory address in Wikipedia is directed to a software programming language pointer concept that is distinct from the current application.  But more importantly, the prior art teaches LWI checking and LWI signaling, to check the changing status of the LWI (paragraphs 98).  LWI is explicitly understood as the location of the last write operation (paragraph 34).  Thus, changing status of the reference to the location of the last write operation is properly understood as a write pointer.  Moreover, the changing status can indicate the completion of a last write.  Thus, Applicant’s argument is unpersuasive.


Conclusion
THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KEVIN X LU whose telephone number is (571)270-1233.  The examiner can normally be reached on M-F 10am-6pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.  
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Lewis Bullock can be reached on 5712723759.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.






/KEVIN X LU/
Examiner, Art Unit 2199

/LEWIS A BULLOCK  JR/Supervisory Patent Examiner, Art Unit 2199