Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 

Response to Arguments
Applicant’s arguments with respect to claims 1-12 and 19-27 have been considered but are moot because the arguments do not apply to any of the references being used in the current rejection.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made

Claims 1-12 and 19-27 are rejected under 35 U.S.C. 103 as being unpatentable over Dong (US 2019/0188148, hereinafter Dong) in view of Turner et al. (US 2018/0336133, hereinafter van Turner).

Regarding claim 1, Dong discloses 
A method, comprising:
creating a plurality of duplicate memory pages for a plurality of compute units (paragraph [0205]: When the page sharing manager 2213 identifies identical memory pages from different guests (based on equivalent PIs) and/or even from the same guest, the page sharing manager 2213 may merge them into one page, illustrated in FIG. 22 as host shared pages 2213), wherein local memory of each compute unit of the plurality of compute units (paragraph [0204]: a page sharing manager 2213 is running inside the VMM 2210, which may run a separate thread to digest the guest memory pages 2203-2204,) stores a respective duplicate memory page of the plurality of duplicate memory pages (paragraph [0205]: When the page sharing manager 2213 identifies identical memory pages from different guests (based on equivalent PIs) and/or even from the same guest, the page sharing manager 2213 may merge them into one page, illustrated in FIG. 22 as host shared pages 2213);
intercepting a memory instruction issued by a particular compute unit of the plurality of compute units (paragraph [0205]: the page sharing manager 2213 compares PIs from different VMs (e.g., comparing PIs 2211 with PIs 2212), and/or compares PIs from the same VM) to the respective duplicate memory page stored in the local memory of the particular compute unit (paragraph [0205]: When the page sharing manager 2213 identifies identical memory pages from different guests (based on equivalent PIs) and/or even from the same guest, the page sharing manager 2213 may merge them into one page, illustrated in FIG. 22 as host shared pages 2213);
identifying that the memory instruction is annotated to indicate that coherence across the plurality of compute units is required (paragraph [0205]: the page sharing manager 2213 compares PIs from different VMs (e.g., comparing PIs 2211 with PIs 2212), and/or compares PIs from the same VM);
responsive to identifying that the memory instruction is annotated to indicate that coherence across the plurality of compute units is required, collapsing the plurality of duplicate memory pages to create a shared memory page in local memory of a select compute unit of the plurality of compute units (paragraph [0205]: When the page sharing manager 2213 identifies identical memory pages from different guests (based on equivalent PIs) and/or even from the same guest, the page sharing manager 2213 may merge them into one page, illustrated in FIG. 22 as host shared pages 2213);
causing the memory instruction to be performed on the shared memory page (paragraph [0210]: he one embodiment of the page sharing manager 2213 applies write-protection to these mappings (e.g., through the EPT or shadow page table)).
Dong does not disclose identifying that the memory instruction is annotated to indicate that coherence across the plurality of compute units, including synchronization of the plurality of duplicate memory pages. Turner discloses identifying that the memory instruction is annotated to indicate that coherence across the plurality of compute units, including synchronization of the plurality of duplicate memory pages (paragraph [0014]; Some aspects may include implementing a synchronization operation for the second cache selected from one of sending, by the first processing device, a page table cache invalidate signal to the second processing device, sending, by the first processing device, an explicit synchronization command to the second processing device, and waiting, by the second processing device, a designated period prior to implementing the synchronization operation; paragraph [0039]: Various processing devices may store copies of the same page table data in respective caches associated with each of the processing devices to realize these performance benefits ... Implementing a page table coherency unit may maintain coherent copies of the page table data in the various caches for use by respective processing devices while improving on the performance lags of the current mechanisms for maintaining coherency between multiple cached copies of the page table data; paragraph [0040]: The page table coherency unit may issue clean and/or invalidate cache commands for the page table data to the processing device to prompt the processing device to execute the cache maintenance commands for its associated cache. The page table coherency unit may stall completion of page table cache synchronization operations until all referenced page table pages are cleaned and/or invalidated in the cache of the processing device; paragraph [0060]: the heterogeneous computing device 300 may include a processing device (e.g., a CPU) 302, a hardware accelerator (e.g., GPU) 306a, a hardware accelerator (e.g., DSP) 306b, and/or a custom hardware accelerator 306c. Each processing device 302, 306a, 306b, 306c may be associated with caches (e.g., private caches 210, 212, 214, 216, and/or shared cache 230 in FIG. 2); paragraph [0064]: The heterogeneous computing device 300 may further include a page table coherency unit 312 configured to manage coherency of the page table data stored in the caches associated with the processing devices 302, 306a, 306b, 306c). It would have been obvious to one of ordinary skill in the art at the time the claimed invention was effectively filed to modify the teaching of Dong by maintaining same (or coherent copies of) page table data in the respective caches associated with each of the processing devices such as a processing device (e.g., a CPU), a hardware accelerator (e.g., GPU), a hardware accelerator (e.g., DSP), and/or a custom hardware accelerator via synchronization of Turner. The motivation would have been to maintaining coherent copies of the page table data in the various caches for use by respective processing devices while improving on the performance lags of the current mechanisms for maintaining coherency between multiple cached copies of the page table data (Turner paragraph [0039]).
Regarding claim 21 referring to claim 1, Dong discloses A system, comprising: a memory storing logic and including a plurality of local memories; a plurality of compute units each having a local memory of the plurality of local memories; and a processor that executes the logic to perform a method comprising: ... (Fig. 1).
Regarding claim 22 referring to claim 1, Dong discloses A non-transitory computer-readable media storing computer instructions that, when executed by a processor, cause the processor to perform a method comprising: ... (paragraph [0231]: As described herein, instructions may refer to specific configurations of hardware such as application specific integrated circuits (ASICs) configured to perform certain operations or having a predetermined functionality or software instructions stored in memory embodied in a non-transitory computer readable medium).

Regarding claim 2, Dong discloses 
wherein the plurality of compute units include at least one graphics processing unit (GPU) (paragraph [0219]: In the case of a GPU 2300, it can be easily detected through the GPU page table (e.g., GGTT and PPGTT), when the guest modification of the GPU page table is trapped by mediation module 2304. In one embodiment, all guest memory pages are initially considered to be non-DMA pages which may be digested by the page sharing manager 2313. As described above, this may involve comparing PIs from different memory pages to identify equivalent pages and merging these as host shared memory pages 2314).

Regarding claim 3, Dong discloses 
wherein the local memory of the at least one GPU is device memory of the at least one GPU (paragraph [0219]: The example illustrated in FIG. 23 includes service VM 2300 with a GPU mediation module 2304 and GPU driver 2303 and two VMs 2301-2301, each with its own GPU driver 2305-2306 and set of guest memory pages 2307-2308).

Regarding claim 4, Dong does not disclose wherein the plurality of compute units include at least one central processing unit (CPU). Turner discloses wherein the plurality of compute units include at least one central processing unit (CPU) (paragraph [0014]; Some aspects may include implementing a synchronization operation for the second cache selected from one of sending, by the first processing device, a page table cache invalidate signal to the second processing device, sending, by the first processing device, an explicit synchronization command to the second processing device, and waiting, by the second processing device, a designated period prior to implementing the synchronization operation; paragraph [0039]: Various processing devices may store copies of the same page table data in respective caches associated with each of the processing devices to realize these performance benefits ... Implementing a page table coherency unit may maintain coherent copies of the page table data in the various caches for use by respective processing devices while improving on the performance lags of the current mechanisms for maintaining coherency between multiple cached copies of the page table data; paragraph [0040]: The page table coherency unit may issue clean and/or invalidate cache commands for the page table data to the processing device to prompt the processing device to execute the cache maintenance commands for its associated cache. The page table coherency unit may stall completion of page table cache synchronization operations until all referenced page table pages are cleaned and/or invalidated in the cache of the processing device; paragraph [0060]: the heterogeneous computing device 300 may include a processing device (e.g., a CPU) 302, a hardware accelerator (e.g., GPU) 306a, a hardware accelerator (e.g., DSP) 306b, and/or a custom hardware accelerator 306c. Each processing device 302, 306a, 306b, 306c may be associated with caches (e.g., private caches 210, 212, 214, 216, and/or shared cache 230 in FIG. 2); paragraph [0064]: The heterogeneous computing device 300 may further include a page table coherency unit 312 configured to manage coherency of the page table data stored in the caches associated with the processing devices 302, 306a, 306b, 306c). It would have been obvious to one of ordinary skill in the art at the time the claimed invention was effectively filed to modify the teaching of Dong by maintaining same (or coherent copies of) page table data in the respective caches associated with each of the processing devices such as a processing device (e.g., a CPU), a hardware accelerator (e.g., GPU), a hardware accelerator (e.g., DSP), and/or a custom hardware accelerator via synchronization of Turner. The motivation would have been to maintaining coherent copies of the page table data in the various caches for use by respective processing devices while improving on the performance lags of the current mechanisms for maintaining coherency between multiple cached copies of the page table data (Turner paragraph [0039]).

Regarding claim 5, Dong does not disclose wherein the local memory of the at least one CPU is system memory of a system that includes the at least one CPU. Turner discloses wherein the local memory of the at least one CPU is system memory of a system that includes the at least one CPU (paragraph [0014]; Some aspects may include implementing a synchronization operation for the second cache selected from one of sending, by the first processing device, a page table cache invalidate signal to the second processing device, sending, by the first processing device, an explicit synchronization command to the second processing device, and waiting, by the second processing device, a designated period prior to implementing the synchronization operation; paragraph [0039]: Various processing devices may store copies of the same page table data in respective caches associated with each of the processing devices to realize these performance benefits ... Implementing a page table coherency unit may maintain coherent copies of the page table data in the various caches for use by respective processing devices while improving on the performance lags of the current mechanisms for maintaining coherency between multiple cached copies of the page table data; paragraph [0040]: The page table coherency unit may issue clean and/or invalidate cache commands for the page table data to the processing device to prompt the processing device to execute the cache maintenance commands for its associated cache. The page table coherency unit may stall completion of page table cache synchronization operations until all referenced page table pages are cleaned and/or invalidated in the cache of the processing device; paragraph [0060]: the heterogeneous computing device 300 may include a processing device (e.g., a CPU) 302, a hardware accelerator (e.g., GPU) 306a, a hardware accelerator (e.g., DSP) 306b, and/or a custom hardware accelerator 306c. Each processing device 302, 306a, 306b, 306c may be associated with caches (e.g., private caches 210, 212, 214, 216, and/or shared cache 230 in FIG. 2); paragraph [0064]: The heterogeneous computing device 300 may further include a page table coherency unit 312 configured to manage coherency of the page table data stored in the caches associated with the processing devices 302, 306a, 306b, 306c). It would have been obvious to one of ordinary skill in the art at the time the claimed invention was effectively filed to modify the teaching of Dong by maintaining same (or coherent copies of) page table data in the respective caches associated with each of the processing devices such as a processing device (e.g., a CPU), a hardware accelerator (e.g., GPU), a hardware accelerator (e.g., DSP), and/or a custom hardware accelerator via synchronization of Turner. The motivation would have been to maintaining coherent copies of the page table data in the various caches for use by respective processing devices while improving on the performance lags of the current mechanisms for maintaining coherency between multiple cached copies of the page table data (Turner paragraph [0039]).

Regarding claim 6, Dong discloses 
further comprising: setting a first write-duplicate bit in a page table entry for a page corresponding to the plurality of duplicate memory pages, the first write-duplicate bit indicating that the page has been duplicated to create the plurality of duplicate memory pages (paragraph [0205]: the page sharing manager 2213 compares PIs from different VMs (e.g., comparing PIs 2211 with PIs 2212), and/or compares PIs from the same VM. When the page sharing manager 2213 identifies identical memory pages from different guests (based on equivalent PIs) and/or even from the same guest, the page sharing manager 2213 may merge them into one page, illustrated in FIG. 22 as host shared pages 2213).

Regarding claim 7, Dong discloses 
further comprising: setting a second write-duplicate bit in a translation lookaside buffer (TLB) of each compute unit of the plurality of compute units, the second write-duplicate bit indicating that the page has been duplicated to create the plurality of duplicate memory pages (paragraph [0148]: Internal caches and Translation Lookaside Buffers (TLB) included in modern GPUs to accelerate data accesses and address translations; paragraph [0205]: the page sharing manager 2213 compares PIs from different VMs (e.g., comparing PIs 2211 with PIs 2212), and/or compares PIs from the same VM. When the page sharing manager 2213 identifies identical memory pages from different guests (based on equivalent PIs) and/or even from the same guest, the page sharing manager 2213 may merge them into one page, illustrated in FIG. 22 as host shared pages 2213).

Regarding claim 8, Dong discloses 
wherein the collapsing is performed responsive to determining that the second write-duplicate bit in the TLB of the particular compute unit is set to indicate that the respective duplicate memory page stored in the local memory of the particular compute unit is a duplicate of the page (paragraph [0148]: Internal caches and Translation Lookaside Buffers (TLB) included in modern GPUs to accelerate data accesses and address translations; paragraph [0205]: the page sharing manager 2213 compares PIs from different VMs (e.g., comparing PIs 2211 with PIs 2212), and/or compares PIs from the same VM. When the page sharing manager 2213 identifies identical memory pages from different guests (based on equivalent PIs) and/or even from the same guest, the page sharing manager 2213 may merge them into one page, illustrated in FIG. 22 as host shared pages 2213).

Regarding claim 9, Dong discloses 
wherein the collapsing is performed responsive to determining that the second write-duplicate bit in the TLB of the particular compute unit is set to indicate that the respective duplicate memory page stored in the local memory of the particular compute unit is a duplicate of the page (paragraph [0148]: Internal caches and Translation Lookaside Buffers (TLB) included in modern GPUs to accelerate data accesses and address translations; paragraph [0205]: the page sharing manager 2213 compares PIs from different VMs (e.g., comparing PIs 2211 with PIs 2212), and/or compares PIs from the same VM. When the page sharing manager 2213 identifies identical memory pages from different guests (based on equivalent PIs) and/or even from the same guest, the page sharing manager 2213 may merge them into one page, illustrated in FIG. 22 as host shared pages 2213).

Regarding claim 10, Dong discloses 
wherein the collapsing is performed when the particular compute unit that issued the memory instruction (paragraph [0205]: When the page sharing manager 2213 identifies identical memory pages from different guests (based on equivalent PIs) and/or even from the same guest, the page sharing manager 2213 may merge them into one page, illustrated in FIG. 22 as host shared pages 2213) is a GPU (paragraph [0219]: The example illustrated in FIG. 23 includes service VM 2300 with a GPU mediation module 2304 and GPU driver 2303 and two VMs 2301-2301, each with its own GPU driver 2305-2306 and set of guest memory pages 2307-2308).

Regarding claim 11, Dong discloses 
wherein the plurality of compute units are a proper subset of all compute units of a computing system (paragraph [0219]: The example illustrated in FIG. 23 includes service VM 2300 with a GPU mediation module 2304 and GPU driver 2303 and two VMs 2301-2301, each with its own GPU driver 2305-2306 and set of guest memory pages 2307-2308).

Regarding claim 12, Dong discloses 
further comprising: after creating the shared memory page (paragraph [0205]: When the page sharing manager 2213 identifies identical memory pages from different guests (based on equivalent PIs) and/or even from the same guest, the page sharing manager 2213 may merge them into one page, illustrated in FIG. 22 as host shared pages 2213), causing the plurality of compute units to issue memory instructions to the shared memory page (paragraph [0210]: he one embodiment of the page sharing manager 2213 applies write-protection to these mappings (e.g., through the EPT or shadow page table)).

Regarding claim 19, Dong discloses 
wherein the collapsing the plurality of duplicate memory pages to create the shared memory page and the causing the memory instruction to be performed on the shared memory page ensures coherence for the memory instruction (paragraph [0205]: the page sharing manager 2213 compares PIs from different VMs (e.g., comparing PIs 2211 with PIs 2212), and/or compares PIs from the same VM. When the page sharing manager 2213 identifies identical memory pages from different guests (based on equivalent PIs) and/or even from the same guest, the page sharing manager 2213 may merge them into one page, illustrated in FIG. 22 as host shared pages 2213).

Regarding claim 20, Dong discloses 
wherein the particular compute unit and the select compute unit are the same compute unit (paragraph [0219]: The example illustrated in FIG. 23 includes service VM 2300 with a GPU mediation module 2304 and GPU driver 2303 and two VMs 2301-2301, each with its own GPU driver 2305-2306 and set of guest memory pages 2307-2308).

Regarding claim 23, Dong discloses 
wherein the memory instruction is one of: a store instruction, an atomic instruction, or a reduction instruction (paragraph [0205]: When the page sharing manager 2213 identifies identical memory pages from different guests (based on equivalent PIs) and/or even from the same guest, the page sharing manager 2213 may merge them into one page, illustrated in FIG. 22 as host shared pages 2213). 

Regarding claim 24, Dong does not disclose wherein the memory instruction is annotated in advance by a programmer of an application causing the memory instruction to be issued. Turner discloses wherein the memory instruction is annotated in advance by a programmer of an application causing the memory instruction to be issued (paragraph [0045]: The page table coherency unit may be configured to keep the page tables coherent. Relying on the capabilities of the page table coherency unit, programmers may not have to program and software may not have to execute costly cache maintenance operations prior to the processing device issuing the page table cache invalidate signals). It would have been obvious to one of ordinary skill in the art at the time the claimed invention was effectively filed to modify the teaching of Dong by programming the page table coherency unit to keep the page tables coherent such that programmers does not have to program to execute costly cache maintenance operations prior to the processing device issuing the page table cache invalidate signals of Turner. The motivation would have been to maintaining coherent copies of the page table data in the various caches for use by respective processing devices while improving on the performance lags of the current mechanisms for maintaining coherency between multiple cached copies of the page table data (Turner paragraph [0039]).

Regarding claim 25, Dong does not disclose wherein the memory instruction is annotated in accordance with a defined consistency model. Turner discloses wherein the memory instruction is annotated in accordance with a defined consistency model (paragraph [0045]: The page table coherency unit may be configured to keep the page tables coherent. Relying on the capabilities of the page table coherency unit, programmers may not have to program and software may not have to execute costly cache maintenance operations prior to the processing device issuing the page table cache invalidate signals). It would have been obvious to one of ordinary skill in the art at the time the claimed invention was effectively filed to modify the teaching of Dong by programming the page table coherency unit to keep the page tables coherent such that programmers does not have to program to execute costly cache maintenance operations prior to the processing device issuing the page table cache invalidate signals of Turner. The motivation would have been to maintaining coherent copies of the page table data in the various caches for use by respective processing devices while improving on the performance lags of the current mechanisms for maintaining coherency between multiple cached copies of the page table data (Turner paragraph [0039]).

Regarding claim 26, Dong discloses 
wherein the plurality of duplicate memory pages are collapsed in a manner that results in a single instance of the shared memory page that incorporates all prior memory instructions (paragraph [0205]: When the page sharing manager 2213 identifies identical memory pages from different guests (based on equivalent PIs) and/or even from the same guest, the page sharing manager 2213 may merge them into one page, illustrated in FIG. 22 as host shared pages 2213) issued by the plurality of compute units (paragraph [0204]: page sharing manager 2213 is running inside the VMM 2210, which may run a separate thread to digest the guest memory pages 2203-2204, such as by using hash algorithm, and generate the list of page identifiers (PIs) 2211-2212 per VM (e.g., using the hash value). 

Regarding claim 27, Dong discloses 
wherein after creating the shared memory page (paragraph [0205]: When the page sharing manager 2213 identifies identical memory pages from different guests (based on equivalent PIs) and/or even from the same guest, the page sharing manager 2213 may merge them into one page, illustrated in FIG. 22 as host shared pages 2213), the plurality of compute units are configured to issue subsequent memory instructions to the shared memory page (paragraph [0205]-[0210]: if GPN1 in VM1 and GPN2 in VM2 maps to host HPN1 and HPN2: GPN1 of VM1.fwdarw.HPN1, GPN2 of VM2.fwdarw.HPN2, and the contents of HPN1 and HPN2 are identical, then the page sharing manager 2213 may use one shared host page 2213, say HPN3 (may be HPN1, or HPN2 or a new page with copied contents from HPN1 or HPN2), and map: GPN1 of VM1.fwdarw.HPN3, plus write-protection, GPN2 of VM2.fwdarw.HPN3, plus write-protection. In the meantime, the one embodiment of the page sharing manager 2213 applies write-protection to these mappings (e.g., through the EPT or shadow page table). This way, the page sharing manager saves physical memory used for backing the guest memory pages).

Allowable Subject Matter
Claims 13-18 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in [0037] CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to [0037] CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to SISLEY KIM whose telephone number is (571)270-7832.  The examiner can normally be reached on 9:30 A.M - 6:30 P.M. 
	If attempts to reach the examiner by telephone are unsuccessful, the examiner's supervisor, Emerson Puente can be reached on (571)272-3652. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300. Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system. Status information for published applications may be obtained from either Private PAIR or Public PAIR. Status information for unpublished applications is available through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000. 
/SISLEY N KIM/Primary Examiner, Art Unit 2196                                                                                                                                                                                                        8/4/2022