Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION
This is in response to applicant’s amendment/response filed on 12/23/2021, which has
been entered and made of record.  Claim 25, 29, 35 and 37 are amended. Claims 28, 36 are cancelled. Claims 25-27, 28-35, 37-39 are pending in the application.
		
	Applicant amended claims and cancelled claims 40-49. It should be 40-44, since there is no claims 45-49 in the original claims.

Response to Arguments
Applicant arguments regarding claim rejections under 103 are considered, but are not persuasive. 
Applicant argues:

    PNG
    media_image1.png
    229
    808
    media_image1.png
    Greyscale

	Examiner disagrees: As shown in FIG. 4 of Duncan, each copy engine 450(i) has an associated DRAM 220(i). Each DRAM 220(i) may have, for example, 1GB block of memory for surface data. During the copying operation, the 1GB memory is divided as a plurality of sub-block, for example, each having 256 bits surface data. ([0069], “copy engine 450(0) configures the crossbar 210 to access a particular memory partition (e.g., memory partition 215(0)) for 


Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.
Below is the correspondence between some of the claims of the instant application and the U.S. Patent No. 10901647.
Instant Application
25
26
27
35
U.S. Patent No. 10901647
1+4
3
5
15+18


Claim 25, 35 are rejected on the ground of nonstatutory double patenting as being unpatentable over claim 1 and 4 of U.S. Patent No. 10901647 in view of Duncan et al. (US 2014/0109102 A1) in view of Shazeer et al. (US 2019/0130213 A1).

wherein each surface data sub-block comprises a start location, size; [0075], “In contrast, host interface 206 of FIG. 4 implements a hardware pre-processor 410 that subdivides the copy operation into multiple subtasks associated with copy operations for small chunks of the block of memory specified by the initial copy operation. For example, a copy operation may request a large 256 MB block of memory to be copied from PP memory 204 to system memory 104. The pre-processor 410 transmits a copy command to copy engine 450(0) for the first 4 kB of memory of the 256 MB block of memory. The pre-processor 410 then modifies the original copy operation by incrementing the starting memory address included in the copy command by 4 kB and stores the modified copy command until copy engine 450(0) has finished executing the first 4 kB copy operation.” [0073], “In one embodiment, the pre-processor 410 includes logic for tracking the state of a task. For example, pre-processor 410 may include registers for storing the state of a task such as the next memory location to be copied in the block of memory associated with a copy operation. In another embodiment, pre-processor 410 may store the state of a task in PP memory 204.”)
wherein each of the plurality of sub-copy engines comprises a plurality of sub-buffers, and wherein a sub-buffer is assigned to a surface data sub-block upon the surface data sub-block being scheduled to a sub-copy engine. ( Duncan FIG. 4, [0069], “Conventionally, the size of a copy operation, as specified by an application, is limited by the size of the physical memory. For example, an application may transmit a command to PPU 202(0) that specifies that a copy operation entails copying a 1 GB block of memory from PP memory 204 to system memory 104. As the copy engine 450(0) processes the command, copy engine 450(0) configures the crossbar 
claim 1 and 4 of U.S. Patent No. 10901647 teaches sub-block of memory. Duncan teaches representing sub-block of memory as staring point and the size. 
It have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to have combined the teachings of U.S. Patent No. 10901647 with the specific teachings of Duncan to easily access sub-block of memory.
Duncan further teaches sub-buffers are assigned to sub-blocks to be used when sub-blocks are accessed by copy engines.
 It have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to have combined the teachings of U.S. Patent No. 10901647 with the specific teachings of Duncan to allow sub-blocks to be saved into the assigned sub-buffers. The benefit would be to easily manage the data access.
However, claim 1 and 4 of U.S. Patent No. 10901647 in view of Duncan does not explicitly teach, Shazeer teaches:
Size of a memory block can be represented as a width and a height ([0078], “In particular, each query block is a 2-dimensional query block of a size lq specified by height and width lq=wq hq and the corresponding memory block extends the query block to the top, left and right by hm, wm and again wm pixels, respectively. Thus, the memory block for each query 
Duncan teaches a memory subblock can be indicated by a size, for example, 4kb. Shazeer teaches that a memory block size can be represented by the width and height.
It have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to have replaced the size representation of claim 1 and 4 of U.S. Patent No. 10901647 in view of Duncan by the width and height representation of Shazeer to obtain predicable results.

Claim 35 recites similar limitations of claim25, is similarly rejected by on the ground of nonstatutory double patenting as being unpatentable over claim 15 and 18 of U.S. Patent No. 10901647 in view of Duncan in view of Shazeer.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention 

Claim 25-27, 35-35 is/are rejected under 35 U.S.C. 103 as being unpatentable over Duncan et al. (US 2014/0109102 A1) in view of Liang et al. (US 2016/0321774 A1) and further in view of Shazeer et al. (US 2019/0130213 A1) .
Regarding claim 25, Duncan teaches:
An apparatus to facilitate copying surface data (FIG. 4, [0067], “Other commands may be related to data transfer tasks (i.e., copy operations), which are transmitted to the one or more copy engines 450.” [0068], “The copy engines 450 are configured to perform copy operations that move data from one memory location to another memory location.”) comprising: 
a central copy engine, FIG. 4:

    PNG
    media_image2.png
    668
    540
    media_image2.png
    Greyscale
)  including:
 a command processor to receive a command to access surface data from a source location in memory to a destination location in the memory, ([0074], “For example, to illustrate the operation of copy engines 450, a copy operation may be received by host interface 206 as part of the command stream included in a pushbuffer. The copy operation may specify a large block of memory to be copied from PP memory 204 to system memory 104.”)
interpret the commands and generate parameters to perform an operation; ([0074], “In conventional systems, the host interface would decode at least part of the command to determine that the command should be transmitted to one of the available copy engines and the entire copy operation would be transmitted to a first copy engine.” [0075], “In contrast, host interface 206 of FIG. 4 implements a hardware pre-processor 410 that subdivides the copy operation into multiple subtasks associated with copy operations for small chunks of the block of memory specified by the initial copy operation. For example, a copy operation may request a large 256 MB block of memory to be copied from PP memory 204 to system memory 104.”)
a sub-block generator to receive the parameters divide the surface data into a plurality of surface data sub-blocks, wherein each surface data sub-block comprises a start location, size ; [0075], “In contrast, host interface 206 of FIG. 4 implements a hardware pre-processor 410 that subdivides the copy operation into multiple subtasks associated with copy operations for small chunks of the block of memory specified by the initial copy operation. For example, a copy operation may request a large 256 MB block of memory to be copied from PP memory 204 to system memory 104. The pre-processor 410 transmits a copy command to copy engine 450(0) for the first 4 kB of memory of the 256 MB block of memory. The pre-processor 410 then modifies the original copy operation by incrementing the starting memory address 
a scheduler to receive the plurality of surface data sub-blocks from the central copy engine and schedule the plurality of surface data sub-blocks for processing; ([0075], “The pre-processor 410 transmits a copy command to copy engine 450(0) for the first 4 kB of memory of the 256 MB block of memory. The pre-processor 410 then modifies the original copy operation by incrementing the starting memory address included in the copy command by 4 kB and stores the modified copy command until copy engine 450(0) has finished executing the first 4 kB copy operation. Then, the pre-processor 410 transmits a new copy command to copy engine 450(0) for the next 4 kB of memory, and so forth. However, if host interface 206 receives another copy operation from a higher priority application before the entire 256 MB block of memory has been copied, then pre-processor 410 pre-empts the lower priority copy operation by transmitting a first copy command associated with the higher priority copy operation to copy engine 450(0) instead of the next subsequent copy command associated with the lower priority copy operation. Thus, in a relatively few number of clock cycles, a higher priority application may pre-empt execution of a particular task on a given processing engine because the processing engine is never allocated to the particular task for more than a maximum number of clock cycles.”) and 
a plurality of sub-copy engines to … to process the surface data sub-blocks and perform the memory accesses. ([0067], “For example, I/O unit 205 may receive commands related to processing tasks that are directed to host interface 206 or commands related to memory access operations (i.e., read/write operations) that are directed to memory interface 214 via crossbar unit 210. Commands related to processing tasks may specify one or more pushbuffers that include command streams for execution by PPU 202(0). The host interface 206 reads commands from each of the pushbuffers and transmits the command stream stored in the pushbuffers to the appropriate processing engine. For example, some commands may be related to a graphics rendering task, which are transmitted to the processing cluster array 230 via a front end unit 212 and a task/work unit 207 (not explicitly shown). Other commands may be related to data transfer tasks (i.e., copy operations), which are transmitted to the one or more copy engines 450.” [0068], “The copy engines 450 may execute concurrently with the processing cluster array 230. In order to perform copy operations,”)
wherein each of the plurality of sub-copy engines comprises a plurality of sub-buffers, and wherein a sub-buffer is assigned to a surface data sub-block upon the surface data sub-block being scheduled to a sub-copy engine. ( Duncan FIG. 4, [0069], “Conventionally, the size of a copy operation, as specified by an application, is limited by the size of the physical memory. For example, an application may transmit a command to PPU 202(0) that specifies that a copy operation entails copying a 1 GB block of memory from PP memory 204 to system memory 104. As the copy engine 450(0) processes the command, copy engine 450(0) configures the crossbar 210 to access a particular memory partition (e.g., memory partition 215(0)) for communicating with the DRAM storing at least part of the 1 GB block of memory (e.g., DRAM 220(0)). The copy 
However, Duncan does not explicitly teach, but Liang teaches:
 a plurality of sub-copy engines to operate in parallel to process the surface data sub-blocks and perform the memory accesses.([0067], “In the example of FIG. 4, four 2D sub-engines 142 and four caches 144 are depicted. In this example, four sections (or sub-primitives of a surface) may be operated on in parallel by 2D sub-engines 142.” [0066], “In general, the parallel arrangement of 2D sub-engines 142 as depicted in FIG. 4 may be referred to as a single 2D engine (i.e., along with 2D dispatch processor 140 and caches 144). More generically, in the context of this disclosure, each of 2D sub-engines 142 may be referred to as a parallel address scanning engine. The techniques of this disclosure may be applied to a source surface and/or a destination surface. A source surface is the surface on which 2D sub-engines 142 are performing an operation. The destination surface is the surface created by 2D sub-engines through performance of the operation on the source surface. 2D sub-engines 142 may use respective caches 144 to temporarily store pixel data before storing the pixel data in memory system 107.”)
Duncan teaches a plurality of sub-copy engines. Liang explicitly teach a plurality of engines can work in parallel based on the instruction of a scheduler . 
It have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to have combined the teachings of Duncan with the teachings of Liang to improve system performance.
However, Duncan in view of Liang does not explicitly teach, Shazeer teaches:
a width and a height ([0078], “In particular, each query block is a 2-dimensional query block of a size lq specified by height and width lq=wq hq and the corresponding memory block extends the query block to the top, left and right by hm, wm and again wm pixels, respectively. Thus, the memory block for each query block extends the query block one or more pixels to the top in the image, to the left in the image, and to the right in the image.”)
Duncan teaches a memory subblock can be indicated by a size, for example, 4kb. Shazeer teaches that a memory block size can be represented by the width and height.
It have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to have replaced the size representation of Duncan in view of Liang by the width and height representation of Shazeer to obtain predicable results.

Regarding claim 26, Duncan in view of Liang and Shazeer teaches:
The apparatus of claim 25, wherein the central copy engine further comprises a queue to queue the surface data sub-blocks for transmission to the plurality of copy engines. (Duncan [0077], “In one embodiment, pre-processor 410 tracks each of the pending operations in an ordered list 420 arranged according to priority. Pre-processor 410 is configured to schedule the highest priority pending operation. In another embodiment, host interface 206 includes a number of FIFOs (not shown), each FIFO associated with a given priority level. As host interface 206 receives tasks, the tasks are added to the particular FIFO associated with that tasks priority level. Pre-processor 410 then selects the next pending task in the highest priority FIFO that includes at least one pending task to schedule on the available processing engine.”)

Regarding claim 27, Duncan in view of Liang and Shazeer teaches:
The apparatus of claim 26, wherein the scheduler schedules the surface data sub-blocks based on a current sub-block processing load at each of the plurality of sub- copy engines. (Duncan [0075], “The pre-processor 410 transmits a copy command to copy engine 450(0) for the first 4 kB of memory of the 256 MB block of memory. The pre-processor 410 then modifies the original copy operation by incrementing the starting memory address included in the copy command by 4 kB and stores the modified copy command until copy engine 450(0) has finished executing the first 4 kB copy operation. Then, the pre-processor 410 transmits a new copy command to copy engine 450(0) for the next 4 kB of memory, and so forth. However, if host interface 206 receives another copy operation from a higher priority application before the entire 256 MB block of memory has been copied, then pre-processor 410 pre-empts the lower priority copy operation by transmitting a first copy command associated with the higher priority copy operation to copy engine 450(0) instead of the next subsequent copy command associated with the lower priority copy operation. Thus, in a relatively few number of clock cycles, a higher priority application may pre-empt execution of a particular task on a given processing engine because the processing engine is never allocated to the particular task for more than a maximum number of clock cycles.”[0027])

Claim 29- 31, 37-38 is/are rejected under 35 U.S.C. 103 as being unpatentable over Duncan in view of Liang and further in view of Shazeer and further in view of Zaidi et al. (US 6016540).

The apparatus of claim 25, wherein each of the plurality of sub-copy engines (see claim 25) further comprises 
However, Duncan in view of Liang and further in view of Shazeer does not, but Zaidi teaches:
a dependency data structure to identify a sub-buffer on which a current surface data sub-block is dependent.(col. 2, middle: “ In accordance with another aspect of the present invention, there is provided an apparatus for scheduling instructions for dispatch to an execution unit. The apparatus includes a waiting buffer that receives a plurality of instructions. A dependency matrix receives a plurality of dependency vectors associated with the instructions received in the waiting buffer, wherein each dependency vector has a bit set that indicates each instruction on which the instruction is dependent. A zero detect circuit is coupled to the dependency matrix and delivers a wave vector indicative of an absence of bits being set in each dependency vector. The wave vector is delivered to a reset input of the dependency matrix to clear the bits set in each of the dependency vectors indicating a dependency on the instructions identified in the wave vector. Port assignment logic is coupled to the zero detect circuit and receives the wave vector, determines the instructions identified in the wave vector that are currently executable, and provides a dispatch wave vector indicative of the instructions that are currently executable. An incomplete wave detector is coupled to the port assignment logic and compares the wave vector and the dispatch wave vector, controllably delivers the output of the zero detect circuit as the next wave vector in response to detecting a 
Duncan in view of Liang and further in view of Shazeer teaches sub-buffers. Zaidi teaches building a dependency matrix for the instructions stored in each sub-buffer in a waiting buffer.
It have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to have combined the teachings of Duncan in view of Liang and further in view of Shazeer with the specific teachings of Zaidi to preserve the order of memory accessing.

Regarding claim 30, Duncan in view of Liang and further in view of Shazeer and Zaidi teaches:
The apparatus of claim 29 wherein the dependency data structure is generated by setting a dependency enable bit and assigning a dependent sub-buffer identifier that identifies the sub-buffer on which the current surface data sub-block is dependent. ( Zaidi col. 2, middle: “ In accordance with another aspect of the present invention, there is provided an apparatus for scheduling instructions for dispatch to an execution unit. The apparatus includes a waiting buffer that receives a plurality of instructions. A dependency matrix receives a plurality of dependency vectors associated with the instructions received in the waiting buffer, wherein each dependency vector has a bit set that indicates each instruction on which the instruction is dependent. A zero detect circuit is coupled to the dependency matrix and delivers a wave vector indicative of an absence of bits being set in each dependency vector. The wave 

Regarding claim 31, Duncan in view of Liang and further in view of Shazeer and Zaidi teaches:
The apparatus of claim 30, wherein each sub-buffer stores a dependency enable bit and a dependent sub-buffer identifier. ( Zaidi col. 2, middle: “ In accordance with another aspect of the present invention, there is provided an apparatus for scheduling instructions for dispatch to an execution unit. The apparatus includes a waiting buffer that receives a plurality of instructions. A dependency matrix receives a plurality of dependency vectors associated with the instructions received in the waiting buffer, wherein each dependency vector has a bit set that indicates each instruction on which the instruction is dependent. A zero detect circuit is coupled to the dependency matrix and delivers a wave vector indicative of an absence of bits being set in each dependency vector. The wave vector is delivered to a reset input of the 
Claim 37-38 recites similar limitations of claim 29-30 respectively, in a form of system, thus are rejected using the same rationale respectively.


Allowable Subject Matter
Claim 32-34, 39 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
The following is a statement of reasons for the indication of allowable subject matter:  None of the references on the record along or in combination teaches the limitations of “wherein each of the plurality of sub-copy engines broadcasts a sub-buffer identifier that is being handled in a current clock cycle.” As recited in claim 32 and similarly recited in claim 39.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to YANNA WU whose telephone number is (571)270-0725. The examiner can normally be reached Monday-Thursday 8:00-5:30 ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kee Tung can be reached on 571-272-7794. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.






/YANNA WU/Primary Examiner, Art Unit 2611