DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This communication is in response to the application filed on 06/25/2021 and the response to the Election/Restriction filed on 08/24/2022.
Claims 1-10 are elected for an examination.  Claims 11-20 are not elected.
Claim 1-10 are pending and are rejected.

Priority
Acknowledgment is made of applicant’s claim for foreign priority under 35 U.S.C. 119 (a)-(d). The certified copy has been filed in parent Application No. IN202141013580t, filed on 03/06/2021.

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 7/4/2022was filed.  The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claim 1 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
As to claim 1, the phrase “optimizer state” is unclear.  The description does not clearly define what is an optimizer state.  For the purpose of examination, it is interpreted that optimizer state is an optimize a process or execution of the state.


Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees.  A nonstatutory double patenting rejection is appropriate where the claims at issue are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); and In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on a nonstatutory double patenting ground provided the reference application or patent either is shown to be commonly owned with this application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b).
The USPTO internet Web site contains terminal disclaimer forms which may be used.  Please visit http://www.uspto.gov/forms/.  The filing date of the application will determine what form should be used.  A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission.  For more information about eTerminal Disclaimers, refer to http://www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.

Claim 1-8 are provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over claim 1-8, 19-20 of copending Application No. 17/359553.

Present Application 17/359471
Co-pending Application 17/359553
Claim 1
A method for checkpointing and migrating a deep learning training (DLT) job operating at a source node of a cloud computing environment and resuming the DLT job from a checkpointed state on a destination node that is different than the source node, the method comprising:
capturing a graphics processing unit (GPU) state of a GPU executing the DLT job, wherein the GPU state includes GPU data comprising model parameters and an optimizer state located in the GPU at a time of checkpointing;
capturing a central processing unit (CPU) state of a CPU executing the DLT job;
migrating the DLT job to the destination node at the checkpointed state using the GPU state and the CPU state; and
initiating resumption of processing of the DLT job from the checkpointed state on the destination node.
Claim 1
A method for providing checkpointing of a deep learning training (DLT) job, at one node in a cloud computing environment and resuming the DLT job from a checkpointed state on a different node, the method comprising:
capturing a graphics processing unit (GPU) state of a GPU executing on the DLT job, wherein the GPU state includes GPU data comprising model parameters and an optimizer state located in the GPU at a time of checkpointing;
capturing a central processing unit (CPU) state of a CPU executing on the DLT job;
migrating the DLT job to the different node at the checkpointed state using the GPU state and the CPU state; and
initiating resumption of processing of the DLT job from the checkpointed state on the different node.

Claim 2
The method of claim 1, further comprising capturing a portion of GPU memory that is active during processing of the DLT job on the source node, the portion of the GPU memory containing the model parameters.
Claim 2
The method of claim 1, further comprising capturing a portion of GPU memory that is active during processing of the DLT job on the original node, the portion of the GPU memory containing the model parameters.
Claim 3
The method of claim 1, further comprising:

resuming the DLT job on a second GPU and a second CPU of the destination node that are different than the GPU and the CPU, respectively, of the source node.
Claim 3
The method of claim 1, further comprising:

resuming the DLT job on a second GPU and a second CPU that are different than the GPU and the CPU, respectively.
Claim 4
The method of claim 1, further comprising: saving a program state associated with the DLT job; and restoring the DLT job on another node through switching control flow to the program state.
Claim 4
The method of claim 1, further comprising:
saving a program state associated with the DLT job; and
restoring the DLT job on another node through switching control flow to the program state.
Claim 5
The method of claim 1, further comprising: isolating any temporary GPU-related mappings to an address space of a proxy process on a proxy node; and computing the DLT job in a main process associated with the CPU, wherein the proxy process is stateless across checkpoints.
Claim 5
The method of claim 1, further comprising:
isolating GPU-related activities into a separate proxy process that has a different address space than the GPU; and
computing the DLT job in a main process associated with the CPU, wherein the proxy process is stateless across checkpoints, isolating temporary GPU- related mappings to the address space of the proxy process.
Claim 6
The method of claim 5, wherein a main process address space remains without any GPU-related state.
Claim 6
The method of claim 1, further comprising establishing a barrier wherein a main process address space remains without any GPU-related state.

Claim 7
The method of claim 5, further comprising: 
directing a proxy server to read GPU function call parameters from shared memory; executing the GPU function calls in an address space of the proxy process; and sending return values to a client of the proxy node through shared memory.
Claims 7, 19
The method of claim 5, further comprising:
directing a proxy server to read the GPU function call parameters from shared memory and execute corresponding GPU function calls in an address space of the proxy process; and
shipping back return values to a proxy client through shared memory.
Claim 8
The method of claim 1, further comprising: 
moving GPU-related activity of the DLT job into a separate address space using dynamic library interposition on GPU-related calls, wherein the GPU-related calls are intercepted in the main process by a client of a proxy process.
Claims 8, 20
The method of claim 1, further comprising:
moving GPU-related activity of the DLT job into a separate address space using dynamic library interposition on GPU-related calls, wherein the GPU-related calls are intercepted in the main process by a client of a proxy process, which serializes and writes the GPU function call parameters into shared memory.



Although the claims at issue are not identical, they are not patentably distinct from each other because the claimed subject matter of the present applicant and that of copending Application No. 17/359553 are substantially the same and the claimed subject matter of the present application would have been obvious to one of ordinary skill in the art based on the claimed subject matter of copending Application No. 17/359553.
This is a provisional nonstatutory double patenting rejection because the patentably indistinct claims have not in fact been patented.

	
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-8 and 10 are rejected under 35 U.S.C. 103 as being unpatentable over Chaudhary (Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning) in view of Ramadoss (US 10,043,232 B1).
As to claim 1, Chaudhary teaches a method for checkpointing and migrating a deep learning training (DLT) job operating at a source node of a cloud computing environment and resuming the DLT job from a checkpointed state on a destination node that is different than the source node, the method comprising:
capturing a graphics processing unit (GPU) state of a GPU executing the DLT job, wherein the GPU state includes GPU data comprising model parameters at a time of checkpointing (section 4, page 11, checkpointed via a migration tools such as CRIU, some DLT jobs are written with checkpoint capability so that they can resume from a previous checkpoint if it exists. Upon checkpoint, the proxy’s memory manager copies the GPU state to the parent process’ CPU memory and dies.  Section 2.2, page 4, Table 1 lists the time per mini-batch in milliseconds for different deep learning training jobs for the older K80 GPU model and the speedup ratios for newer GPUs like P40, P100, and V100);
capturing a central processing unit (CPU) state of a CPU executing the DLT job (section 4, page 11, upon checkpoint, the proxy’s memory manager copies the GPU state to the parent process’ CPU memory and dies.  The parent process can then be simply CRIU’ed);
migrating the DLT job to the destination node at the checkpointed state using the GPU state and the CPU state (section 4, page 11, The proxy process is responsible for 1) translating all CUDA handles such as stream, context, etc. 2) keeping a log of all state changing CUDA calls, so that they can be replayed upon a restore, and 3) memory management of GPU memory); and
initiating resumption of processing of the DLT job from the checkpointed state on the destination node (section 4, page 11, Upon restore the proxy process replays the log of state changing CUDA calls and copies the GPU memory back.  When the framework is notified to suspend, it completes it within about 100ms by copying the minimal data in the GPU (proxy process) at the end of a mini-batch of training to the CPU memory (parent process), thus, allowing the scheduler to run another job on the GPU).
Chaudhary does not explicitly teach
an optimizer state located in the GPU.
Ramadoss teaches
an optimizer state located in the GPU (col. lines  , graphics processor 2810 can execute different shader programs via separate logic, such that the vertex processor 2805 is optimized to execute operations for vertex shader programs).
It would have been obvious to a person of ordinary skill in the art before the effective filling date of the claimed invention made to include in the Chaudhary disclosure, an optimize function of the GPU for executing a process, as taught by Ramadoss.  One would be motivated to do so to increase the compute unit utilization within the compute cluster, improving the efficiency and performance of the GPGPU.
	
As to claim 2, Chaudhary and Ramadoss teach the method of claim 1, Chaudhary further teaches
capturing a portion of GPU memory that is active during processing of the DLT job on the source node, the portion of the GPU memory containing the model parameters (section 3.3, page 8, out of the over 110K jobs submitted in that trace, about 86.6% are 1-GPU jobs (capturing a portion of GPU memory).  The dominance of 1-GPU jobs implies that migration can be an effective mechanism to ensure that sufficient number of 1-GPU jobs are "packed" in servers to avoid the above non-work conserving scenario).

As to claim 3, Chaudhary and Ramadoss teach the method of claim 1, Chaudhary further teaches
resuming the DLT job on a second GPU and a second CPU of the destination node that are different than the GPU and the CPU, respectively, of the source node (section 4, page 11, in order to migrate jobs, we need to be able to checkpoint jobs on-demand and then resume these jobs on a different node).

As to claim 4, Chaudhary and Ramadoss teach the method of claim 1, Chaudhary further teaches
saving a program state associated with the DLT job (section 4, page 11, keeping (saving) a log of all state changing CUDA calls); and 
restoring the DLT job on another node through switching control flow to the program state (section 4, page 11, so that they can be replayed upon a restore).

As to claim 5, Chaudhary and Ramadoss teach the method of claim 1, Chaudhary further teaches
isolating any temporary GPU-related mappings to an address space of a proxy process on a proxy node (section 4, pages 11-12, when the framework is notified to suspend, it completes it within about 100ms by copying the minimal data in the GPU (proxy process) at the end of a mini-batch of training to the CPU memory (isolating any temporary GPU-related mappings to an address space), thus, allowing the scheduler to run another job on the GPU); and 
computing the DLT job in a main process associated with the CPU, wherein the proxy process is stateless across checkpoints (section 1, page 2, Gandivafair relies on job migration as a key primitive for enforcing fairness without forfeiting job state (stateless)).

As to claim 6, Chaudhary and Ramadoss teach the method of claim 5, wherein Chaudhary further teaches
a main process address space remains without any GPU-related state (section 4, page 11, intercept all CUDA calls made by the process, and direct it via our proxy. This way the main process’ address space remains CPU only, and can be easily checkpointed via CRIU).

As to claim 7, Chaudhary and Ramadoss teach the method of claim 5, Chaudhary further teaches
directing a proxy server to read GPU function call parameters from shared memory (section 4, page 11, intercept all CUDA calls made by the process, and direct it via our proxy); 
executing the GPU function calls in an address space of the proxy process (section 4, page 11, translating all CUDA handles such as stream, context, etc. 2) keeping a log of all state changing CUDA calls); and 
sending return values to a client of the proxy node through shared memory (section 4, page 11, so that they can be replayed upon a restore, and 3) memory management of GPU memory).

As to claim 8, Chaudhary and Ramadoss teach the method of claim 1, Chaudhary further teaches
moving GPU-related activity of the DLT job into a separate address space using dynamic library interposition on GPU-related calls, wherein the GPU-related calls are intercepted in the main process by a client of a proxy process (section 3.3, page 8, Gandivafair leverages the mechanism to migrate a job on-demand at a low cost, that allows jobs to be moved across servers. Second, the specific workload of DLT jobs in large clusters is particularly well-suited to a migration policy that performs intelligent packing of jobs to avoid such pathological scenarios.  Section 4, page 11, intercept all CUDA calls made by the process, and direct it via our proxy). 

As to claim 10, Chaudhary and Ramadoss teach the method of claim 1, Ramadoss further teaches 
generating a distributed snapshot of all workers associated with the DLT job for inter-worker consistency, and storing the distributed snapshot as part of the checkpointed state (col. 11, lines 7-12, each reset block may have a dedicated space in on-chip context save memory and a snapshot of context state for the set of compute units within a reset block is occasionally saved to the dedicated on-chip context save memory for the reset block. The snapshot state can be used as checkpoint context).
It would have been obvious to a person of ordinary skill in the art before the effective filling date of the claimed invention made to include in the Chaudhary disclosure, a snapshot associated of context state, as taught by Ramadoss.  One would be motivated to do so to increase the compute unit utilization within the compute cluster, improving the efficiency and performance of the GPGPU.

Claim 9 is rejected under 35 U.S.C. 103 as being unpatentable over Chaudhary (Balancing Efficiency and Fairness in Heterogeneous GPU Clusters for Deep Learning) in view of Ramadoss (US 10,043,232 B1) and further in view of Travostino (US 20070180436 A1).
As to claim 9, Chaudhary and Ramadoss teach the method of claim 1, Chaudhary does not explicitly teach 
prior to initiating resumption of processing of the DLT job from the checkpointed state on the destination node, copying, to the destination node, a delta of changes made to a file system on the source node, the changes being made after capturing the GPU state and the CPU state.

prior to initiating resumption of processing of the DLT job from the checkpointed state on the destination node, copying, to the destination node, a delta of changes made to a file system on the source node, the changes being made after capturing the GPU state and the CPU state ([0035], fig. 1, during the synchronization stage, execution of the VM 28 ceases at the source computing system 14 and begins at the destination computing system 18. Before the VM 28 starts executing at the destination computing system 18, a final iteration copy of delta data produces a consistent copy of the VM 28 at the source and destination computing systems).
It would have been obvious to a person of ordinary skill in the art before the effective filling date of the claimed invention made to include in the Chaudhary disclosure, a copy of a delta file between a source and a destination for a migration of a virtual machine, as taught by Travostino.  One would be motivated to do so to downtime versus the time spent migrating the VM state to the destination computing system.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Roberts (US 20220188606 A1) and Zhu (US 20220164327 A1).
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ANH NGUYEN whose telephone number is (571)270-0657. The examiner can normally be reached M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Umar Cheema can be reached on 5712703037. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ANH NGUYEN/Primary Examiner, Art Unit 2456