DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This office correspondence is in response to “Amendment and Response under 37 C.F.R. 1.111” filed on August 8, 2022.
Claims 1 – 20 are pending.
Claims 1 – 20 are amended.
Claims 1 – 20 are rejected.
Response to Arguments
Applicant’s arguments filed on 8/8/2022 have been fully considered and are persuasive in regard to the rejection of claims 1 – 20 under 35 U.S.C. 103 and said rejections from the prior office action is withdrawn.  However, applicant’s amendments precipitated a new search and consideration of the amended claims and new grounds of rejection were found for claims 1 – 20 under 35 U.S.C. 103.  The examiner here now responds to each argument.  Underlined text represents amendments to the claims made subsequent to the prior office action.
In regard to claims 1, 8, and 15, the applicant argues that the prior art combination of Shmueli and Zhou does not teach, anticipate or suggest:
A) “in response to the first stage of the first training job being finished, registering, by the computer, a second stage of the first training job to train [[a]] the first deep learning model after the early stopping of the first stage of the first training job 
The applicant states:
“ . . . Shmueli teaches the second stage that is an inference stage (instead of a training
stage) to apply the trained machine learning model. However, the instant patent application discloses a first second-stage training job (or a second stage of the first training job to train a first deep learning model after the early stopping of a first stage of the first training job) is not an inference stage applying the trained model but a second training stage to train the same deep learning model that has been trained in a first training stage. Therefore, Shmuel does not teach the limitation, and the combination of Shmueli and Zhou does not teach the limitation either.

To further clearly distinguish the limitation from the reference, Applicant amends the limitation as follows: registering, by the computer, a second stage of the first training
job to train the first deep learning model after the early stopping of the first stage of the
first training job. . . .” (Applicant’s Remarks page 20)

A) In response to the applicant’s argument:
The applicant amended the imitation under review to require a second stage training of a first training job after an early stopping.  The amended requirement is not explicitly taught by the prior art combination of Shmueli and Zhou.  Therein, the applicant’s argument is persuasive and the rejections under 35 U.S.C. 103 over Shmueli and Zhou are withdrawn.  However, the applicant’s amendment required a new search and consideration to be performed, which resulted in introducing a new ground of rejection under 35 USC 103 as the amended claims being un-patentable over Shmueli et al. (U.S. 2021/0056108 A1; herein referred to as Shmueli) in view of Zhou et al. (U.S. 2021/01216854 A1; herein referred to as Zhou) in further view of Wesolowski et al. (U.S. 2019/0114537 A1; herein referred to as Wesolowski).  The new prior art reference Wesolowski is analogous art that is directed to establishing access to first and second different computing systems where a machine learning model is assigned for training to the first computing system, and the first computing system creates a check-point during training in response to a first predefined triggering event. The check-point may be a record of an execution state in the training of the machine learning model by the first computing system. In response to a second predefined triggering event, the training of the machine learning model on the first computing system is halted, and in response to a third predefined triggering event, the training of the machine learning model is transferred to the second computing system, which continues training the machine learning model starting from the execution state recorded by the check-point (see Wesolowski – abstract).  When combined with prior art Shmueli and Zhou, Wesolowski teaches the limitation as amended.  The applicant is referred to the rejections described below.
B) “in response to receiving, while executing the second stage of the first training job, a registration of a first stage of a second training job to train a second deep learning model, finishing, by the computer, the small number of epochs in the second stage of the first training job and executing the first stage of the second training job;” (as recited in claim 1 and substantially replicated in claims 8 and 15)
The applicant states:
“ . . . (1) Zhou teaches first stage training but the instant patent application discloses
second-stage training job. (2) Zhou teaches different stages of training on different
network sets; however, the instant patent application discloses scheduling the different
stages of training jobs. Therefore, Zhou does not teach the limitation, and the combination of Shmueli and Zhou does not teach the limitation either.

To further clearly distinguish the limitation from the reference, Applicant amends
the limitation as follows: in response to receiving, while executing the second stage of the first training job, a registration of a first stage of a second training job to train a second deep learning model, finishing, by the computer, the small number of epochs in the second stage of the first training job and executing the first stage of the second training job. . . “ (Applicant’s remarks page 22)

B) In response to the applicant’s argument:
The applicant amended the imitation under review to require a second training job to be scheduled while executing the second stage of the first training job.  The amended requirement is not explicitly taught by the prior art combination of Shmueli and Zhou.  Therein, the applicant’s argument is persuasive and the rejections under 35 U.S.C. 103 over Shmueli and Zhou are withdrawn.  However, the applicant’s amendment required a new search and consideration to be performed, which resulted in introducing a new ground of rejection under 35 USC 103 as the amended claims being un-patentable over Shmueli et al. (U.S. 2021/0056108 A1; herein referred to as Shmueli) in view of Zhou et al. (U.S. 2021/01216854 A1; herein referred to as Zhou) in further view of Wesolowski et al. (U.S. 2019/0114537 A1; herein referred to as Wesolowski).  The new prior art reference Wesolowski is analogous art which was discussed previously and contains a paragraph to schedule stages of training (see Wesolowski ¶ [0025]  and when combined with the teachings of  Shmueli and Zhou, the resulting combination teaches the amended limitation.  The applicant is referred to the rejections described below.
C) “in response to receiving, while executing the second stage of the first training job, a registration of a second stage of the second training job that trains the second deep learning job after the early stopping of the first stage of the second training job and has a higher priority than the second stage of the first training job, finishing, by the computer, the small number of epochs in the second stage of the first training job and executing the second stage of the second training job.”
The applicant states:
“ . . . Zhou’s paragraph 0028 teaches sorting neural networks, searching neural networks, performing different stages (first and second) of training on different neural network sets.  However, the instant patent application discloses scheduling two second stages of different training jobs. Therefore, Zhou does not teach the limitation, and the combination of Shmueli and Zhou does not teach the limitation either.

To further clearly distinguish the limitation from the reference, Applicant amends the limitation as follows: in response to receiving, while executing the second stage of the
first training job, a registration of a second stage of the second training job that trains the second deep learning job after the early stopping of the first stage of the second training job and has a higher priority than the second stage of the first training job, finishing, by the computer, the small number of epochs in the second stage of the first training job and executing the second stage of the second training job. . . . “ (Applicant’s remarks pages 23-24)

C) In response to the applicant’s argument:
The applicant amended the imitation under review to require the scheduling of two second stages of different training jobs.  The amended requirement is not explicitly taught by the prior art combination of Shmueli and Zhou.  Therein, the applicant’s argument is persuasive and the rejections under 35 U.S.C. 103 over Shmueli and Zhou are withdrawn.  However, the applicant’s amendment required a new search and consideration to be performed, which resulted in introducing a new ground of rejection under 35 USC 103 as the amended claims being un-patentable over Shmueli et al. (U.S. 2021/0056108 A1; herein referred to as Shmueli) in view of Zhou et al. (U.S. 2021/01216854 A1; herein referred to as Zhou) in further view of Wesolowski et al. (U.S. 2019/0114537 A1; herein referred to as Wesolowski).  The new prior art reference Wesolowski is analogous art which was discussed previously and contains a paragraph to schedule second stages of training (see Wesolowski ¶ [0026]  and when combined with the teachings of  Shmueli and Zhou, the resulting combination teaches the amended limitation.  The applicant is referred to the rejections described below.
Therein as a result of the further search and consideration necessitated by the applicant’s amendments to claims 1 – 20, new grounds of rejection were found for:
Claims 1, 8 and 15  are rejected under 35 U.S.C. 103 as being unpatentable over Shmueli et al. (U.S. 2021/0056108 A1; herein referred to as Shmueli) in view of Zhou et al. (U.S. 2021/01216854 A1; herein referred to as Zhou) in further view of Wesolowski et al. (U.S. 2019/0114537 A1; herein referred to as Wesolowski);
Claims 2, 5 - 7, 9, 12 - 14, 16, and 19 - 20 are rejected under 35 U.S.C. 103 as being unpatentable over Shmueli et al. (U.S. 2021/0056108 A1; herein referred to as Shmueli) in view of Zhou et al. (U.S. 2021/01216854 A1; herein referred to as Zhou) in further view of Wesolowski et al. (U.S. 2019/0114537 A1; herein referred to as Wesolowski) in further view of Dirac et al. (U.S. 2020/0151606 A1; herein referred to as Dirac);
Claims 3,  10,  and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Shmueli et al. (U.S. 2021/0056108 A1; herein referred to as Shmueli) in view of Zhou et al. (U.S. 2021/01216854 A1; herein referred to as Zhou) in further view of Wesolowski et al. (U.S. 2019/0114537 A1; herein referred to as Wesolowski) in further view of Wang et al. (U.S. 2017/0228645 A1; herein referred to as Wang); and claims 4, 11, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Shmueli et al. (U.S. 2021/0056108 A1; herein referred to as Shmueli) in view of Zhou et al. (U.S. 2021/01216854 A1; herein referred to as Zhou) in further view of Wesolowski et al. (U.S. 2019/0114537 A1; herein referred to as Wesolowski) in further view of Wang et al. (U.S. 2017/0228645 A1; herein referred to as Wang) in further view of Zhang et al. (U.S. 2020/0175384 A1; herein referred to as Zhang).
The applicant is directed to the respective rejections described below.
The examiner recommends that the applicant review the specification for disclosure that if integrated into the independent claims would distinguish the amended claims from the cited prior art.  The applicant is invited to contact the examiner for an interview to discuss how to move the prosecution forward.
35 USC § 101 Analysis
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title. 

Claims 1 – 20 are directed to statutory subject matter.  The claims are directed to non-abstract improvements in computer related technology.  A claim is non-statutory when it is directed to a judicial exception (e.g. either one of mathematical concepts, mental processes, or certain methods of organizing human activity) without significantly more.  The claimed invention is not directed to a judicial exception.  Instead, the claimed invention is directed to a technological improvement to deep learning model training through an implementing for efficient use of computing resources in two stage training of a deep learning model, where a computer executes a first first-stage training job to train a deep learning model that is finished by using early stopping, and further registering a first second-stage training job to train a deep learning model that has been trained in the first first-stage training job, such that the computer executes the first second-stage training job with a small number of epochs and in response to receiving a registration of a second first-stage training job, finishing, by the computer, the small number of epochs in the first second-stage training job and executing the second first-stage training job, and further in response to receiving a registration of a second second-stage training job that has a higher priority than the first second-stage, finishing, by the computer, the small number of epochs in the first second-stage training job and executing the second second-stage training job.  The ordered limitations of the claimed invention provides an improvement for deep learning model training by implementing techniques for two-stage training and the efficient scheduling to execute the training jobs.  As such the claimed invention is statutory.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.


Claims 1, 8 and 15  are rejected under 35 U.S.C. 103 as being unpatentable over Shmueli et al. (U.S. 2021/0056108 A1; herein referred to as Shmueli) in view of Zhou et al. (U.S. 2021/01216854 A1; herein referred to as Zhou) in further view of Wesolowski et al. (U.S. 2019/0114537 A1; herein referred to as Wesolowski).
  In regard to claim 1, Shmueli teaches a computer-implemented method for efficient use of computing resources in two stage training of multiple training jobs (see ¶ [0006]” . . . a method comprising: receiving a plurality of pairs of queries associated with a database, wherein the queries in each pair in the plurality of pairs of queries have an identical FROM clause; at a training stage, training a machine learning model on a training set comprising: (i) the plurality of pairs of queries, and (ii) labels associated with containment rates between each of the pairs of queries over the database; and at an inference stage, applying the trained machine learning model to a pair of target queries, to estimate containment rates between the target pair of queries over the database. . . “) to train deep learning models, the method comprising  (see ¶ [0051] “. . . a specialized deep learning scheme may be used, which is configured to represent pairs of SQL queries. Experiments conducted by the present inventors on a real-world database, have shown that the present disclosure for estimating cardinalities, using containment rates between queries, realizes significant improvements over known cardinality estimation methods . . . “).:
executing, by a computer, a first stage of a first training job  (e.g. training stage) to train a first deep learning model (see ¶ [0006] “ . . . receiving a plurality of pairs of queries associated with a database, wherein the queries in each pair in the plurality of pairs of queries have an identical FROM clause; at a training stage, training a machine learning model on a training set comprising: (i) the plurality of pairs of queries, and (ii) labels associated with containment rates between each of the pairs of queries over the database . . . “);  
finishing, by the computer, the first stage of the first training job, by using early stopping (see ¶ ¶ [0086-0089] “ . . . Building CRN involves two main steps:  (1) Generating a random training set using the schema and data information as described above; and (2) Repeatedly using this training data to train the present CRN model until the mean q-error of the validation test starts to converge to its best absolute value.   In some embodiments, the early stopping technique can be used and stop the training before convergence to avoid overfitting. Both steps are performed on an immutable snapshot of the database. After the training phase, to predict the containment rate of an input query pair, the queries first need to be transformed into their feature representation, and then they are presented as input to the model, and the model outputs the estimated containment rate. . . “);
in response to the first stage of the first training job being finished (see ¶ [0073] “ . . .  an initial training set can be obtained for the model, which consists of, e.g., 100,000 pairs of queries with zero to two joins. The training samples can be split into 80% training samples and 20% validation samples. . . .” see ¶ [0093] “ . . . FIG. 4 shows how the mean q-error of the validation set decreases with additional epochs, until convergence to a mean q-error of around 4.5. The present CRN model requires almost 120 passes on the training set to converge. On average, measured across six runs, a training run with 120 epochs takes almost 200 minutes . . .”);
executing, by the computer, the second stage of the first training job with a small number of epochs (see ¶¶` [0290-0293] “ . . Following the training phase, and in order to predict the uniqueness rate of an input query, the (encoded) query is simply presented to the present PUNQ model, and the model outputs the estimated uniqueness rate, as described hereinunder. The model was trained and tested using the Tensor-Flow framework, using the Adam training optimize . . . Averaged over 5 runs over the validation set, the best configuration has a 128 batch size, a 512 hidden layer size, and a 0.001 learning rate. These settings are thus used throughout the present model evaluation.  Under these settings, the present PUNQ model converges to a mean q-error of approximately 3.5, after running about 200 passes on the training set. Averaged over 5 runs, the 200 epochs training phase takes nearly 30 minutes.  With these settings, the disk-serialized model size is about 0.5 MB. This includes all the learned weights and biases . . .”);
Shmueli  fails to explicitly teach registering, by the computer, a  second stage of the first training job to train the first deep learning model after the early stopping of the first stage of the first training job; 
in response to receiving, while executing the second stage of the first training job,  a registration of a first stage of a second training job to train a second deep learning model, finishing, by the computer, the small number of epochs in the second stage of the first training job and executing the first stage of the second training job; and in response to receiving, while executing the second stage of the first training job  a registration of a second stage of the second training job that trains the second deep learning job after the early stopping of the first stage of the second training job and has a higher priority than the second stage of the first training job, finishing, by the computer, the small number of epochs in the second stage of the first training job and executing the second stage of the second training job.
However Zhou teaches in response to receiving, a registration of a first stage of a second training job to train a second deep learning model  (e.g. first stage training on first neural network), finishing, by the computer, the small number of epochs (e.g. iteration)  in the second stage of the first training job and executing the first stage of the second training job (e.g. first stage training on fourth neural network) (see ¶ [0011] “ . . . after performing first-stage training on the first neural network set to be trained by using the training data set, the method further comprises executing X iterations, the iterations including: adding S evolved neural networks to the neural network library to be searched, wherein the evolved neural networks are obtained by evolving the neural networks in the neural network library to be searched, and S is equal to R; sorting neural networks with a number of trained cycles of the third preset value in the neural network library to be searched and the S evolved neural networks according to the descending order of recognition accuracy on the training data set to obtain a fourth neural network sequence set, and taking the first N neural networks in the fourth neural network sequence set as a third neural network set to be trained; sorting neural networks with a number of trained cycles of the first preset value in the neural network library to be searched according to the descending order of recognition accuracy on the training data set to obtain a fifth neural network sequence set, and taking first M neural networks in the fifth neural network sequence set as a fourth neural network set to be trained; performing the second-stage training on the third neural network set to be trained by using the training data set, and performing the first-stage training on the fourth neural network set to be trained by using the training data set; and the method further comprises removing neural networks which have not been trained in T iterations from the neural network library to be searched, where T is less than X . . . “); and in response to receiving,  a registration of a second stage of the second training job that trains the second deep learning job after the early stopping of the first stage of the second training job(see ¶ [0026] “ . . .  the sorting unit is further configured to: before sorting neural networks with a number of trained cycles of a first preset value in the neural network library to be searched according to a descending order of recognition accuracy on the training data set to obtain a first neural network sequence set and taking first M neural networks in the first neural network sequence set as a first neural network set to be trained, sort neural networks with a number of trained cycles of a third preset value in the neural network library to be searched according to the descending order of recognition accuracy on the training data set to obtain a second neural network sequence set, and taking first N neural networks in the second neural network sequence set as a second neural network set to be trained; and the training unit is further configured to perform second-stage training on the second neural network set to be trained by using the training data set, wherein the sum of the number of training cycles of the second-stage training and the third preset value is equal to the first preset value . . . “) and has a higher priority(e.g. recognition accuracy) than the second stage of the first training job   (see ¶ [0027] “ . . . the neural network searching device further comprises a neural network evolution unit configured to: before sorting neural networks with a number of trained cycles of a third preset value in the neural network library to be searched according to the descending order of recognition accuracy on the training data set to obtain a second neural network sequence set and taking first N neural networks in the second neural network sequence set as a second neural network set to be trained, add R evolved neural networks to the neural network library to be searched, wherein the evolved neural networks are obtained by evolving the neural networks in the neural network library to be searched; and the sorting unit is in particular configured to sort neural networks with a number of trained cycles of the third preset value in the neural network library to be searched and the R evolved neural networks according to the descending order of recognition accuracy on the training data set to obtain a third neural network sequence set, and taking the first N neural networks in the third neural network sequence set as the second neural network set to be trained . . . “), finishing, by the computer, the small number of epochs in the second stage of the first training job and executing the second stage of the second training job  (see ¶ [0028] “ . . .  the neural network searching device further comprises an execution unit configured to: after performing first-stage training on the first neural network set to be trained by using the training data set, execute X iterations, the iterations including: adding S evolved neural networks to the neural network library to be searched, wherein the evolved neural networks are obtained by evolving the neural networks in the neural network library to be searched, and S is equal to R; sorting neural networks with a number of trained cycles of the third preset value in the neural network library to be searched and the S evolved neural networks according to the descending order of recognition accuracy on the training data set to obtain a fourth neural network sequence set, and taking the first N neural networks in the fourth neural network sequence set as a third neural network set to be trained; sorting neural networks with a number of trained cycles of the first preset value in the neural network library to be searched according to the descending order of recognition accuracy on the training data set to obtain a fifth neural network sequence set, and taking first M neural networks in the fifth neural network sequence set as a fourth neural network set to be trained; performing the second-stage training on the third neural network set to be trained by using the training data set, and performing the first-stage training on the fourth neural network set to be trained by using the training data set; and the neural network searching device further comprises a removing unit for removing neural networks which have not been trained in T iterations from the neural network library to be searched, where T is less than X . . .”).
It would have been obvious to one with ordinary skill in the art before the effective filing date of the applicant’s application to incorporate a system and method for acquiring a neural network library to be searched and a training data set; sorting neural networks with a number of trained cycles of a first preset value in the neural network library to be searched according to a descending order of recognition accuracy on the training data set to obtain a first neural network sequence set, and taking first M neural networks in the first neural network sequence set as a first neural network set to be trained; performing first-stage training on the first neural network set to be trained by using the training data set, wherein the number of training cycles of the first-stage training is a second preset value; and taking a neural network with a number of trained cycles of the sum of the first preset value and the second preset value in the neural network library to be searched as a target neural network, as taught by Zhou, into a system and method for receiving a plurality of pairs of queries associated with a database, and at a training stage, training a machine learning model on a training set comprising: (i) the plurality of pairs of queries, and (ii) labels associated with containment rates between each of the pairs of queries over the database; and at an inference stage, applying the trained machine learning model to a pair of target queries, to estimate containment rates between the target pair of queries over the database, as taught by Shmueli.  Such incorporation provides that the two stage training implemented by the teaching of Shmueli is further enhanced by the analysis of the accuracy of the training models to facilitate the most efficient model to use for the workload.
The combination of Shmueli and Zhou fails to explicitly teach registering, by the computer, a  second stage of the first training job to train the first deep learning model after the early stopping of the first stage of the first training job; while executing the second stage of the first training job,  while executing the second stage of the first training job. However Wesolowski teaches registering, by the computer, a  second stage of the first training job (e.g. transferred to a second system) to train the first deep learning model after the early stopping ( e.g. at a checkpoint)of the first stage of the first training job (see ¶ [0022] “ . . . Check-points may be determined during the training of a machine learning model and irrespective of the type of machine(s) on which the machine learning model is being trained. A check-point (e.g., on operation state, which may include, for example, an iteration number, weight parameter values, intermediate calculation values, etc.) may record sufficient information to resume the training process at a later time (or on a different machine) continuing from the check-point, as if training were (temporarily) halted at a point in time following the recording of the check-point. Additionally, the check-point may include information regarding the architectural characteristics of the computing system (which may include one or more computing machines) on which the machine learning model is being trained. For example, if training of the ML model were halted on a first computing system having a first computer architecture (or a first number of computing machines), then training of the same machine learning model may be transferred to a second computing system having a second computer architecture (or second number of computing machines) different than the first computer architecture (or first number of computing machines), and resumed at a specified check-point (e.g., at the most current check-point).
while executing the second stage of the first training job (see  ¶ [0025] “ . . . a scheduler machine (or master ML control system) may be in charge of transferring the training of ML models across multiple different computing machines. For example, the scheduler machine may routinely audit available computing resources and selectively transfer the training of a particular machine learning model from one machine to a second, faster machine or to a machine whose computer architecture is more closely aligned with the computing requirements of the particular ML model. For example, if the scheduler detects that a peak-usage period for a machine, or system, on which an ML model is being trained is approaching or has occurred, the scheduler may instruct the machine to save the current state of the ML training at a check-point and assign the training task to another machine that is available. . . “), while executing the second stage of the first training job (see  ¶ [0026] “ . . . the scheduler machine may distribute execution of a single machine learning model across multiple different computing machines, so that each computing machine trains a different portion (e.g., graph-segment) of the ML model and the different computing machines exchange processing data, as needed. In this case, the scheduler machine may monitor the performance of each computing machine, and if necessary, transfer execution of a portion of the machine learning model from one machine to a faster or slower machine, as necessary, to maintain optimal timing between the transferring of processing data between the machines (e.g., to minimize wait time by one machine waiting for another machine to reach a point where a check-point may be created or to complete transferring of processing data) . . . “)
 It would have been obvious to one with ordinary skill in the art before the effective filing date of the applicant’s application to incorporate a system and method for training machine learning models and when a triggering event occurs, halting a stage of  training and transferring to a different computer system to provide a second stage of training, as taught by Wesolowski, into a system and method for receiving a plurality of pairs of queries associated with a database, and at a training stage, training a machine learning model on a training set comprising: (i) the plurality of pairs of queries, and (ii) labels associated with containment rates between each of the pairs of queries over the database; and at an inference stage, applying the trained machine learning model to a pair of target queries, and wherein a neural network library to be searched according to a descending order of recognition accuracy on the training data set to obtain a first neural network sequence set, and taking first M neural networks in the first neural network sequence set as a first neural network set to be trained; performing first-stage training on the first neural network set to be trained by using the training data set, wherein the number of training cycles of the first-stage training is a second preset value; and taking a neural network with a number of trained cycles of the sum of the first preset value and the second preset value in the neural network library to be searched as a target neural network, and to estimate containment rates between the target pair of queries over the database, as taught by the combination of Shmueli and Zhou.  Such incorporation enables a scheduling for machine model training stages across different systems. 
In regard to claim 8, Shmueli teaches a computer program product for efficient use of computing resources in two stage training of multiple training jobs to train deep learning models (see ¶ [0051] “. . . a specialized deep learning scheme may be used, which is configured to represent pairs of SQL queries. Experiments conducted by the present inventors on a real-world database, have shown that the present disclosure for estimating cardinalities, using containment rates between queries, realizes significant improvements over known cardinality estimation methods . . . “), the computer program product comprising one or more computer-readable tangible storage devices and program instructions stored on at least one of the one or more computer-readable tangible storage devices, the program instructions executable to (see ¶ [0359] “ . . . The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention. . . .”):
execute, by a computer, a first stage of a first training job (e.g. training stage) to train a first deep learning model (see ¶ [0006] as described for the rejection of claim 1 and is incorporated herein);
finish, by the computer, the first stage of the first training job, by using early stopping (see ¶ ¶ [0086-0089] as described for the rejection of claim 1);
in response to the first stage of the first training job being finished (see ¶ [0073], ¶ [0093] as described for the rejection of claim 1), 
execute, by the computer, the second stage of the first training job with a small number of epochs (see ¶¶` [0290-0293] as described for the rejection of claim 1);
Shmueli  fails to explicitly teach register, by the computer, a  second stage of the first training job to train the first deep learning model after the early stopping of the first stage of the first training job; 
in response to receiving, while executing the second stage of the first training job,  a registration of a first stage of a second training job to train a second deep learning model, finish, by the computer, the small number of epochs in the second stage of the first training job and executing the first stage of the second training job; and in response to receiving, while executing the second stage of the first training job  a registration of a second stage of the second training job that trains the second deep learning job after the early stopping of the first stage of the second training job and has a higher priority than the second stage of the first training job, finish, by the computer, the small number of epochs in the second stage of the first training job and executing the second stage of the second training job.
However Zhou teaches in response to receiving, a registration of a first stage of a second training job to train a second deep learning model  (e.g. first stage training on first neural network), finish, by the computer, the small number of epochs (e.g. iteration)  in the second stage of the first training job and executing the first stage of the second training job (e.g. first stage training on fourth neural network) (see ¶ [0011] as described for the rejection of claim 1 and is incorporated herein) ; and in response to receiving,  a registration of a second stage of the second training job that trains the second deep learning job after the early stopping of the first stage of the second training job(see ¶ [0026] as described for the rejection of claim 1 and is incorporated herein) and has a higher priority(e.g. recognition accuracy) than the second stage of the first training job   (see ¶ [0027] as described for the rejection of claim 1 and is incorporated herein) ), finish, by the computer, the small number of epochs in the second stage of the first training job and executing the second stage of the second training job  (see ¶ [0028] as described for the rejection of claim 1 and is incorporated herein).
The motivation to combine Zhou with Shmueli is described for the rejection of claim 1 and is incorporated herein.
The combination of Shmueli and Zhou fails to explicitly teach register, by the computer, a  second stage of the first training job to train the first deep learning model after the early stopping of the first stage of the first training job; while executing the second stage of the first training job,  while executing the second stage of the first training job. However Wesolowski teaches register, by the computer, a  second stage of the first training job (e.g. transferred to a second system) to train the first deep learning model after the early stopping ( e.g. at a checkpoint)of the first stage of the first training job (see ¶ [0022] as described for the rejection of claim 1 and is incorporated herein), while executing the second stage of the first training job (see  ¶ [0025] as described for the rejection of claim 1 and is incorporated herein) , while executing the second stage of the first training job (see  ¶ [0026] as described for the rejection of claim 1 and is incorporated herein).
The motivation to combine Wesolowski with the combination of Shmueli and Zhou is described for the rejection of claim 1 and is incorporated herein.
In regard to claim 15, Shmueli teaches a computer system for efficient use of computing resources in two stage training of multiple training jobs to train deep learning models, the computer system comprising (see ¶ [0051] as described for the rejection of claim1 and is incorporated herein):
one or more processors, one or more computer readable tangible storage devices, and program instructions stored on at least one of the one or more computer readable tangible storage devices for execution by at least one of the one or more processors (see ¶ [0359] as described for the rejection of claim1 and is incorporated herein), the program instructions executable to (see ¶ [0365] “ . . . The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks . . .”):
execute, by a computer, a first stage of the first training job (e.g. training stage)  to train a first deep learning model (see ¶ [0006] as described for the rejection of claim 1 and is incorporated herein);
finish, by the computer, the first stage of the first training job, by using early stopping (see ¶ ¶ [0086-0089] as described for the rejection of claim 1 and is incorporated herein);
in response to the first stage of the first training job being finished (see ¶ [0073], ¶ [0093] as described for the rejection of claim 1 and is incorporated herein), 
execute, by a computer, the second stage training job with a small number of epochs (see ¶¶` [0290-0293] as described for the rejection of claim 1 and is incorporated herein);
Shmueli  fails to explicitly teach register, by the computer, a  second stage of the first training job to train the first deep learning model after the early stopping of the first stage of the first training job; 
in response to receiving, while executing the second stage of the first training job,  a registration of a first stage of a second training job to train a second deep learning model, finish, by the computer, the small number of epochs in the second stage of the first training job and executing the first stage of the second training job; and in response to receiving, while executing the second stage of the first training job  a registration of a second stage of the second training job that trains the second deep learning job after the early stopping of the first stage of the second training job and has a higher priority than the second stage of the first training job, finish, by the computer, the small number of epochs in the second stage of the first training job and executing the second stage of the second training job.
However Zhou teaches in response to receiving, a registration of a first stage of a second training job to train a second deep learning model  (e.g. first stage training on first neural network), finish, by the computer, the small number of epochs (e.g. iteration)  in the second stage of the first training job and executing the first stage of the second training job (e.g. first stage training on fourth neural network) (see ¶ [0011] as described for the rejection of claim 1 and is incorporated herein) ; and in response to receiving,  a registration of a second stage of the second training job that trains the second deep learning job after the early stopping of the first stage of the second training job(see ¶ [0026] as described for the rejection of claim 1 and is incorporated herein) and has a higher priority(e.g. recognition accuracy) than the second stage of the first training job   (see ¶ [0027] as described for the rejection of claim 1 and is incorporated herein) ), finish, by the computer, the small number of epochs in the second stage of the first training job and executing the second stage of the second training job  (see ¶ [0028] as described for the rejection of claim 1 and is incorporated herein).
The motivation to combine Zhou with Shmueli is described for the rejection of claim 1 and is incorporated herein.
The combination of Shmueli and Zhou fails to explicitly teach register, by the computer, a  second stage of the first training job to train the first deep learning model after the early stopping of the first stage of the first training job; while executing the second stage of the first training job,  while executing the second stage of the first training job. However Wesolowski teaches register, by the computer, a  second stage of the first training job (e.g. transferred to a second system) to train the first deep learning model after the early stopping ( e.g. at a checkpoint)of the first stage of the first training job (see ¶ [0022] as described for the rejection of claim 1 and is incorporated herein), while executing the second stage of the first training job (see  ¶ [0025] as described for the rejection of claim 1 and is incorporated herein) , while executing the second stage of the first training job (see  ¶ [0026] as described for the rejection of claim 1 and is incorporated herein).
The motivation to combine Wesolowski with the combination of Shmueli and Zhou is described for the rejection of claim 1 and is incorporated herein.
Claims 2, 5 - 7, 9, 12 - 14, 16, and 19 - 20 are rejected under 35 U.S.C. 103 as being unpatentable over Shmueli et al. (U.S. 2021/0056108 A1; herein referred to as Shmueli) in view of Zhou et al. (U.S. 2021/01216854 A1; herein referred to as Zhou) in further view of Wesolowski et al. (U.S. 2019/0114537 A1; herein referred to as Wesolowski) as applied to claims 1, 8, and 15 in further view of Dirac et al. (U.S. 2020/0151606 A1; herein referred to as Dirac)
In regard to claim 2,  the combination of Shmueli, Zhou and Wesolowski fails to explicitly teach determining, by the computer, whether a predetermined condition of the early stopping is met for the first stage of the first training job; in response to determining the predetermined condition of the early stopping being met, taking, by the computer, a snapshot of the first stage of the first training Job; and recording, by the computer, training epochs and one or more metrics of the first stage of the first training job.  However Dirac teaches determining, by the computer, whether a predetermined condition (e.g. condition to trigger) of the early stopping is met for the first stage of the first training job (see ¶ [0020] “ . . . For an initial stage or sub-phase of the training phase of the model, a training coordinator may assign a first subset of the execution platforms available. In various embodiments, the training coordinator may also identify one or more conditions which are to trigger, prior to the completion of the training phase, a deployment of a different subset of the plurality of execution platform . . .”);
in response to determining the predetermined condition of the early stopping being met (see ¶ [0022] “ . . . The training of the model may be ended based on any of various factors in different embodiments: e.g., if the cost function being optimized has met the optimization goal, if the entire training data set has been analyzed as many times as intended, if the maximum time set aside for training has elapsed, if the cumulative resource consumption of the training process has reached a threshold, or if a client's budget for training the model has been exhausted . . .”), taking, by the computer, a snapshot (e.g. repartition a portion of the training data)  of the first stage of the first training Job (see ¶ [0021] “ . . . at least a portion of the training data set may be repartitioned, e.g., so that respective portions of the training data are assigned to each of the execution platforms of the second subset . . “); and
recording, by the computer, training epochs (see ¶ [0029] “ . . . the training phase may be deemed complete if a certain number of passes through the training data set have been completed, or a targeted time for training has elapsed . . . “ see ¶ [0037] “ . . . Each complete pass through the training data set may be termed an “epoch”  . . .”)  and one or more metrics of the first stage of the first training job ((see ¶ [0021] “ . . . If the first subset includes multiple platforms, the training data set may be partitioned among the members of the first subset in at least some embodiments. The training coordinator may collect various metrics as the first stage or first set of operations of the training phase progresses—e.g., metrics regarding the amount of parameter synchronization data being transferred, the fraction of the training data set that is yet to be examined during the current iteration or pass through the training data set, the extent of convergence that has been achieved towards the optimization goal being pursued, resource utilization levels at the execution platform pool members and/or the interconnect(s) being used for the synchronization data, and so on. . . “).
It would have been obvious to one with ordinary skill in the art before the effective filing date of the applicant’s application to incorporate a system and method for recognizing triggering conditions for when implementing machine learning models that causes deployment of the models to different sets of execution platforms, as taught by Dirac, into a system and method for receiving a plurality of pairs of queries associated with a database, and at a training stage, training a machine learning model on a training set comprising: (i) the plurality of pairs of queries, and (ii) labels associated with containment rates between each of the pairs of queries over the database; and at an inference stage, applying the trained machine learning model to a pair of target queries, and wherein a neural network library to be searched according to a descending order of recognition accuracy on the training data set to obtain a first neural network sequence set, and taking first M neural networks in the first neural network sequence set as a first neural network set to be trained; performing first-stage training on the first neural network set to be trained by using the training data set, wherein the number of training cycles of the first-stage training is a second preset value; and taking a neural network with a number of trained cycles of the sum of the first preset value and the second preset value in the neural network library to be searched as a target neural network, and to estimate containment rates between the target pair of queries over the database, and further detecting a triggering event and halting a stage of  training and transferring to a different computer system to provide a second stage of training as taught by the combination of Shmueli, Zhou and Wesolowski.  Such incorporation enables the system to recognize when the learning model needs to be stopped or moved to a new execution platform.    
In regard to claim 5, the combination of Shmueli, Zhou, Wesolowski and Dirac teaches determining, by the computer, whether a computing resource is available (see Dirac  ¶ [0018] “ . . . at least for some training algorithms, the relative amounts of network bandwidth resources required, versus the computation resources required at the execution platforms, may change over the course of a given training phase of a given model. Instead of using the same number of execution platforms throughout the training phase, in at least some scenarios it may be useful to change the number of execution platforms (and/or the types of execution platforms) deployed at various stages of the training phase. Using such a dynamic scaling technique, the total amount of time (and/or the total resource usage) for training the model may be reduced or minimized in various embodiments . . .”);
in response to determining the computing resource being available, proceeding, by the computer, with execution of the first stage of the first  training job (see Dirac  ¶ [0021] “ . . . The first subset of the execution platforms may then be activated to initiate the training phase in various embodiments. If the first subset includes multiple platforms, the training data set may be partitioned among the members of the first subset in at least some embodiments. The training coordinator may collect various metrics as the first stage or first set of operations of the training phase progresses—e.g., metrics regarding the amount of parameter synchronization data being transferred, the fraction of the training data set that is yet to be examined during the current iteration or pass through the training data set, the extent of convergence that has been achieved towards the optimization goal being pursued, resource utilization levels at the execution platform pool members and/or the interconnect(s) being used for the synchronization data, and so on. In some embodiments the training coordinator may detect, e.g., using some of the metrics collected, that one or more of the triggering conditions for a deployment change has been met. In such a scenario, a second subset of the plurality of execution platforms may be identified, to be used for at least a second stage or a second set of operations of the training phase. The second subset may include, for example, a different number of execution platforms, or at least some platforms which differ in performance or functional capabilities from one or more platforms of the first subset. . . “);
in response to determining the computing resource being not available (see Dirac  ¶ [0017] “ . . . either the optimization goal has been achieved to within a desired level of proximity, or the resources available for training the model have been exhausted . . .”) , determining, by the computer, whether all computing resources are used by one or more other first stage training jobs (see Dirac  ¶ [0019] “ . . . an indication of a request to train a machine learning model using a specified training data set may be received at one or more computing devices responsible for coordinating model training and/or for making parallelism-related resource deployment decisions. Such computing devices may be referred to herein as training coordinators or parallelism decision nodes. . . .”);
in response to determining the all computing resources being used by the one or more other first stage training jobs (see Dirac  ¶ [0019] above), delaying, by the computer, the first stage of the first training job until one of the one or more other first stage training jobs is finished (see Dirac  ¶ [0041] “ . . . The training coordinator may monitor the progress of the different EPs in the depicted embodiment (represented by shaded portions 515), and detect at approximately time T1 that EPs 150A and 150C have made much less progress through their partitions than EPs 150B and 150D. If the difference between the relative amounts of progress made by different EPs exceeds a threshold, this may trigger a deployment change 555 in the depicted embodiment. For example, an addition EP 150E may be assigned to share the processing of the as-yet-unexamined portion of original partition A with EP 150A, and an additional EP 150F may be assigned to share the processing of the as-yet-unexamined original partition C with EP 150C. The unexamined portion of partition A may thus in effect be divided into two new partitions A2.1 and A2.2 assigned to EPs 150A and 150E respectively, and similarly the unexamined portion of partition C may be divided into two new partitions C2.1 and C2.2 assigned to EPs 150C and 150F respectively. In at least some embodiments, the training coordinator may not necessarily implement a deployment change such as DC 555 on the basis of the lagging progress of one or more EPs—instead, additional factors such as a reduction in bandwidth demand may also be taken into account. Thus, in some embodiments, new EPs may not be assigned in the middle of a given epoch unless at least some threshold reduction in data transfer bandwidth demand has occurred, even if the current set of EPs differ greatly from one another in their relative progress through their partitions . . . “);
in response to determining not all computing resources being used by the one or more other first stage training jobs (see Dirac  ¶ [0019] above), delaying, by the computer, the first stage of the training job until a final epoch of a currently executed second stage training job is finished (see Dirac  ¶ [0040] “ . . . instead of waiting until an epoch is completed to make a deployment change, in some embodiments a training coordinator may alter the EP set of a training phase during the course of an epoch under certain conditions. FIG. 5 illustrates an example of a dynamic scaling technique in which deployment changes may be implemented within one or more epochs of a training phase, according to at least some embodiments. As in the example scenario shown in FIG. 4, a training data set 510 is divided at time T0 into four partitions for an initial deployment 552 of four EPs, 150A-150D. The four EPs proceed to update the model using the respective observation records of partitions A, B, C and D at different rates. . . “).
The motivation to combine Dirac with the combination of Shmueli.  Zhou and Wesolowski is described for the rejection of claim 2 and is incorporated herein.  Additionally, Dirac enables the training jobs to be scheduled in accordance with the availability of the resources. 
In regard to claim 6, the combination of Shmueli, Zhou, and Dirac teaches determining, by the computer, whether a computing resource is available (see Dirac  ¶ [0018] as described for the rejection of claim 5 and is incorporated herein);
in response to determining the computing resource being available, proceeding, by the computer, with execution of the second stage of the first training job (see Dirac  ¶ [0022] “ . . . The second subset of the execution platforms may then be activated (and, depending on the overlap between the first subset and the second subset, one or more execution platforms of the first subset may be de-activated or released for other uses). The training coordinator may resume monitoring metrics pertaining to the progress of the training phase. As needed, if the triggering conditions for deployment changes are met, additional changes to the set of execution platforms deployed for the training phase may be made over time. Eventually, the goals of the training phase may be reached, and the training phase may be terminated. . . .”); 
in response to determining the computing resource being not available (see Dirac  ¶ [0017] as described for the rejection of claim 5 and is incorporated herein), determining, by the computer, whether all computing resources are used by one or more first stage training jobs(see Dirac  ¶ [0019] as described for the rejection of claim 5 and is incorporated herein);
in response to determining the all computing resources being used by the one or more first stage training jobs (see Dirac  ¶ [0019] above),, delaying, by the computer, the second stage of the first training job until one of the one or more first stage training jobs is finished  (see Dirac  ¶ [0042]” . . . aggregated or average resource utilization levels within the EP pool may be taken into account when making at least some types of deployment changes. Considerations of resource utilization (or possible exhaustion of resource capacity) may be especially important when a model is being trained using multi-tenant or shared resources, as may occur at a machine learning service such as that illustrated in FIG. 9 and discussed below. FIG. 6 illustrates an example of a dynamic scaling technique in which deployment changes may be implemented during a training phase based on resource utilization levels of an execution platform pool, according to at least some embodiments. . . .”);
in response to determining not all computing resources being used by the one or more first stage training jobs (see Dirac  ¶ [0019] above), determining, by the computer, whether a currently executed second stage training job has a lower priority than the second stage of the first training job (see Dirac  ¶ [0046]” . . .  a training coordinator may make deployment change decisions based on any desired combination of factors of the kinds discussed with respect to FIG. 3-FIG. 6, and may not necessarily be restricted to considerations of a fixed number of factors. Furthermore, in at least some embodiments, other factors than those illustrated herein may be taken into account (e.g., in addition to or instead of the factors discussed with respect to FIG. 3-FIG. 6). For example, in some embodiments clients may indicate relative priorities for different model training tasks, or may indicate respective budgets for training different models, and the priorities and/or budgets may influence the deployment decisions made during the course of a given training phase . . . .”);
in response to determining the currently executed second stage training job having the lower priority (see Dirac  ¶ [0046] above), delaying, by the computer, the second stage of the first training job until a final epoch of the currently executed second stage training job is finished (see Dirac  ¶ [0050] “ . . . potential deployment change proposals may be voted on when an EP has completed specified amounts of processing: e.g., when each EP has completed processing 25%, 50% or 75% of the training data assigned to it. In one embodiment, the EPs may vote at least at one or more epoch boundaries. The voting decisions of the respective EPs may be based, for example, on the EP's local metrics regarding bandwidth demands and/or other resource demands. . . .”) ; and
in response to determining the currently executed second stage training job not having the lower priority (see Dirac  ¶ [0046] above), delaying, by the computer, the second stage of the first training job until one of following conditions is met: one of the one or more other first stage training jobs is finished and an upper limit of training epochs of the currently executed second stage training job is reached (see Dirac  ¶ [0060]  “ . . . One or more triggering conditions for deployment changes (e.g., changes to the number and/or type of platforms to be used for at least some of the remaining operations of the training phase at the time the change decision is made) may be identified (element 1010), along with the particular changes to be made if and when the conditions are met (e.g., whether the number of platforms is to be decreased or increased, whether different classes of platforms are to be deployed after the change, and so on). Sources of the data that are to be used to make the deployment change decisions may also be identified, such as training progress or epoch completion monitors, resource utilization monitors, network monitors and the like. In some embodiments, respective triggering conditions for different deployment changes may be identified, and some of the conditions may have other conditions as prerequisites. For example, a first condition C1 which is to lead to a deployment change DC1 may be identified, and a second condition which is to lead to a different deployment change DC2 only if DC1 has already been implemented may be identified. In some embodiments, only the triggering conditions may be identified, and the specific changes to be made to the execution platform set of the conditions are met may be determined after the conditions are met . . .”).
The motivation to combine Dirac with the combination of Shmueli,  Zhou and Wesolowski is described for the rejection of claim 2 and is incorporated herein.  Additionally, Dirac enables the training jobs to be scheduled in accordance with the availability of the resources and in accordance with the priority of the training job.
In regard to claim 7, the combination of Shmueli, Zhou, Wesolowski and Dirac teaches in response to determining the second stage of the first training job having a highest priority among second stage training jobs (see Dirac  ¶ [0046]  “ . . . a training coordinator may make deployment change decisions based on any desired combination of factors of the kinds discussed with respect to FIG. 3-FIG. 6, and may not necessarily be restricted to considerations of a fixed number of factors. Furthermore, in at least some embodiments, other factors than those illustrated herein may be taken into account (e.g., in addition to or instead of the factors discussed with respect to FIG. 3-FIG. 6). For example, in some embodiments clients may indicate relative priorities for different model training tasks, or may indicate respective budgets for training different models, and the priorities and/or budgets may influence the deployment decisions made during the course of a given training phase . . .”), a plurality of computing resources being available, and no first stage training job being executed on the computing resources (see Dirac  ¶ [0062] “ . . . Various types of metrics that may influence deployment changes may be collected (e.g., once every T seconds), and the progress of the training towards one or more training goals may be tracked (element 1016). If the collected data indicates that one or more of the triggering conditions has been met (as detected in element 1019), and sufficient resources are available to make a corresponding deployment change, a different set of execution platforms may be selected from the pool of platforms (element 1022) and the training data set may be repartitioned for the next stage of training if necessary . . . “) , 
executing, by the computer, the first second-stage training job in parallel on more than one computing resource (see Dirac  ¶ [0058] “ . . . a parallelizable or parallel training technique may be selected for the model based on various factors. In some cases the request may indicate the particular technique to be used, for example. In some embodiments the technique may be selected from a library of available techniques, e.g., based on the kind of model to be generated, one or more knowledge base entries, the size of the training data set, and/or the number of execution platforms that are currently available for use. In some embodiments, a selected training technique may have the property that the amount of synchronization data (e.g., gradient data used to coordinate model parameter updates) that has to be transferred among the participating parallel platforms generally tends to decrease as an optimization goal of the technique is approached—e.g., as the optimization converges on a solution such as a minimization of an error function, fewer adjustments typically have to be made to the model parameters. . . .”).
The motivation to combine Dirac with the combination of Shmueli,  Zhou and Wesolowski is described for the rejection of claim 2 and is incorporated herein.  Additionally, Dirac enables parallel processing of training workloads.
In regard to claim 9, the combination of Shmueli, Zhou and Wesolowski fails to explicitly teach determine, by the computer, whether a predetermined condition of the early stopping is met for the first stage of the first training job; in response to determining the predetermined condition of the early stopping being met, take, by the computer, a snapshot of the first stage of the first training job; and record, by the computer, training epochs and one or more metrics of the first stage of the first training job.  However Dirac teaches determine, by the computer, whether a predetermined condition (e.g. condition to trigger) of the early stopping is met for the first stage of the first training job (see ¶ [0020] as described for the rejection of claim 2 and is incorporated herein);
in response to determining the predetermined condition of the early stopping being met (see ¶ [0022] as described for the rejection of claim 2 and is incorporated herein), take, by the computer, a snapshot (e.g. repartition a portion of the training data)  of the first stage of the first training job (see ¶ [0021] as described for the rejection of claim 2 and is incorporated herein); and
record, by the computer, training epochs (see ¶ [0029], ¶ [0037] as described for the rejection of claim 1 and is incorporated herein) and one or more metrics of the first stage of the first training job (see ¶ [0021] as described for the rejection of claim 2 and is incorporated herein).
The motivation to combine Dirac with the combination of Shmueli, Zhou and Wesolowski is described for the rejection of claim 2 and is incorporated herein.
In regard to claim 12, the combination of Shmueli, Zhou, Wesolowski and Dirac teaches determine, by the computer, whether a computing resource is available(see Dirac  ¶ [0018] as described for the rejection of claim 5 and is incorporated herein);
in response to determining the computing resource being available, proceed, by the computer, with execution of the first stage of the first  training job (see Dirac  ¶ [0021] as described for the rejection of claim 5 and is incorporated herein);
in response to determining the computing resource being not available (see Dirac  ¶ [0017] as described for the rejection of claim 5 and is incorporated herein), determine, by the computer, whether all computing resources are used by one or more other first stage training jobs(see Dirac  ¶ [0019] as described for the rejection of claim 5 and is incorporated herein);
in response to determining the all computing resources being used by the one or more other first stage training jobs(see Dirac  ¶ [0019] above), delay, by the computer, the first stage of the first  training job until one of the one or more other first stage training jobs is finished (see Dirac  ¶ [0041] as described for the rejection of claim 5 and is incorporated herein);
in response to determining not all computing resources being used by the one or more other first stage training jobs (see Dirac  ¶ [0019] above), delay, by the computer, the first stage of the first training job until a final epoch of a currently executed second stage training job is finished  (see Dirac  ¶ [0040] as described for the rejection of claim 5 and is incorporated herein).
The motivation to combine Dirac with the combination of Shmueli, Zhou and Wesolowski is described for the rejection of claim 5 and is incorporated herein.
In regard to claim 13, the combination of Shmueli, Zhou, Wesolowski and Dirac teaches determine, by the computer, whether a computing resource is available (see Dirac  ¶ [0018] as described for the rejection of claim 5 and is incorporated herein);
in response to determining the computing resource being available, proceed, by the computer, with execution of the second stage of the first training job  (see Dirac  ¶ [0022] as described for the rejection of claim 6 and is incorporated herein);
in response to determining the computing resource being not available (see Dirac  ¶ [0017] as described for the rejection of claim 5 and is incorporated herein), determine, by the computer, whether all computing resources are used by one or more first stage training jobs (see Dirac  ¶ [0019] as described for the rejection of claim 5 and is incorporated herein);
in response to determining the all computing resources being used by the one or more first stage training jobs(see Dirac  ¶ [0019] above), delay, by the computer, the second stage of the first  training job until one of the one or more first stage training jobs is finished (see Dirac  ¶ [0042] as described for the rejection of claim 6 and is incorporated herein);
in response to determining not all computing resources being used by the one or more first stage training jobs (see Dirac  ¶ [0019] above), determine, by the computer, whether a currently executed second stage training job has a lower priority than the second stage of the first  training job(see Dirac  ¶ [0046] as described for the rejection of claim 6 and is incorporated herein);
in response to determining the currently executed second stage training job having the lower priority (see Dirac  ¶ [0046] above), delay, by the computer, the second stage of the first training job until a final epoch of the currently executed second stage training job is finished(see Dirac  ¶ [0050] as described for the rejection of claim 6 and is incorporated herein); and
in response to determining the currently executed second stage training job not having the lower priority(see Dirac  ¶ [0046] above), delay, by the computer, the second stage of the first  training job until one of following conditions is met: one of the one or more other first stage training jobs is finished and an upper limit of training epochs of the currently executed second stage training job is reached  (see Dirac  ¶ [0060] as described for the rejection of claim 6 and is incorporated herein).
The motivation to combine Dirac with the combination of Shmueli, Zhou and Wesolowski is described for the rejection of claim 6 and is incorporated herein.
In regard to claim 14, the combination of Shmueli, Zhou, Wesolowski and Dirac teaches in response to determining the second stage of the first training job having a highest priority among second stage training jobs (see Dirac  ¶ [0046] as described for the rejection of claim 7 and is incorporated herein), a plurality of computing resources being available, and no first stage training job being executed on the computing resources (see Dirac  ¶ [0062] as described for the rejection of claim 7 and is incorporated herein), execute, by the computer, the first second-stage training job in parallel on more than one computing resource (see Dirac  ¶ [0058] as described for the rejection of claim 7 and is incorporated herein).
The motivation to combine Dirac with the combination of Shmueli, Zhou and Wesolowski is described for the rejection of claim 7 and is incorporated herein.
In regard to claim 16, the combination of Shmueli , Zhou and Wesolowski fails to explicitly teach determine, by the computer, whether a predetermined condition of the early stopping is met for the first stage of the first training job; in response to determining the predetermined condition of the early stopping being met, take, by the computer, a snapshot of the first stage of the first  training job; and record, by the computer, training epochs and one or more metrics of the first stage of the first training job.  However Dirac teaches determine, by the computer, whether a predetermined condition (e.g. condition to trigger) of the early stopping is met for the first stage of the first training job (see ¶ [0020] as described for the rejection of claim 2 and is incorporated herein);
in response to determining the predetermined condition of the early stopping being met (see ¶ [0022] as described for the rejection of claim 2 and is incorporated herein), take, by the computer, a snapshot (e.g. repartition a portion of the training data)  of the first stage of the first training job (see ¶ [0021] as described for the rejection of claim 2 and is incorporated herein); and
record, by the computer, training epochs (see ¶ [0029], ¶ [0037] as described for the rejection of claim 2 and is incorporated herein) and one or more metrics of the first stage of the first training job (see ¶ [0021] as described for the rejection of claim 2 and is incorporated herein).
The motivation to combine Dirac with the combination of Shmueli, Zhou and Wesolowski is described for the rejection of claim 2 and is incorporated herein..
In regard to claim 19, the combination of Shmueli, Zhou, Wesolowski and Dirac teaches determine, by the computer, whether a computing resource is available(see Dirac  ¶ [0018] as described for the rejection of claim 5 and is incorporated herein);
in response to determining the computing resource being available, proceed, by the computer, with execution of the first stage of the first training job(see Dirac  ¶ [0021] as described for the rejection of claim 5 and is incorporated herein);
in response to determining the computing resource being not available(see Dirac  ¶ [0017] as described for the rejection of claim 5 and is incorporated herein), determine, by the computer, whether all computing resources are used by one or more other first stage training jobs(see Dirac  ¶ [0019] as described for the rejection of claim 5 and is incorporated herein);
in response to determining the all computing resources being used by the one or more other first stage training jobs (see Dirac  ¶ [0019] above), delay, by the computer, the first stage of the first training job until one of the one or more other first stage training jobs is finished (see Dirac  ¶ [0041] as described for the rejection of claim 5 and is incorporated herein);
in response to determining not all computing resources being used by the one or more other first stage training jobs (see Dirac  ¶ [0019] above), delay, by the computer, the first stage of the first training job until a final epoch of a currently executed second stage training job is finished (see Dirac  ¶ [0040] as described for the rejection of claim 5 and is incorporated herein).
The motivation to combine Dirac with the combination of Shmueli, Zhou and Wesolowski is described for the rejection of claim 5 and is incorporated herein.
In regard to claim 20, the combination of Shmueli, Zhou, Wesolowski and Dirac teaches determine, by the computer, whether a computing resource is available (see Dirac  ¶ [0018] as described for the rejection of claim 5 and is incorporated herein);
in response to determining the computing resource being available, proceed, by the computer, with execution of the second stage of the first training job (see Dirac  ¶ [0022] as described for the rejection of claim 6 and is incorporated herein);
in response to determining the computing resource being not available (see Dirac  ¶ [0017] as described for the rejection of claim 5 and is incorporated herein), determine, by the computer, whether all computing resources are used by one or more first stage training jobs (see Dirac  ¶ [0019] as described for the rejection of claim 5 and is incorporated herein);
in response to determining the all computing resources being used by the one or more first stage training jobs (see Dirac  ¶ [0019] above), delay, by the computer, the second stage of the first  training job until one of the one or more first stage training jobs is finished (see Dirac  ¶ [0042] as described for the rejection of claim 6 and is incorporated herein);
in response to determining not all computing resources being used by the one or more first stage training jobs (see Dirac  ¶ [0019] above), determine, by the computer, whether a currently executed second stage training job has a lower priority than the second stage of the first  training job (see Dirac  ¶ [0046] as described for the rejection of claim 6 and is incorporated herein);
in response to determining the currently executed second stage training job having the lower priority (see Dirac  ¶ [0046] above), delay, by the computer, the second stage of the first training job until a final epoch of the currently executed second stage training job is finished (see Dirac  ¶ [0050] as described for the rejection of claim 6 and is incorporated herein); and
in response to determining the currently executed second stage training job not having the lower priority (see Dirac  ¶ [0046] above), delay, by the computer, the second stage of the first training job until one of following conditions is met: one of the one or more other first stage training jobs is finished and an upper limit of training epochs of the currently executed second stage training job is reached (see Dirac  ¶ [0060] as described for the rejection of claim 6 and is incorporated herein).
The motivation to combine Dirac with the combination of Shmueli, Zhou and Wesolowski is described for the rejection of claim 6 and is incorporated herein.
Claims 3,  10,  and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Shmueli et al. (U.S. 2021/0056108 A1; herein referred to as Shmueli) in view of Zhou et al. (U.S. 2021/01216854 A1; herein referred to as Zhou) in further view of Wesolowski et al. (U.S. 2019/0114537 A1; herein referred to as Wesolowski) as applied to claims 1, 8, and 15 in further view of Wang et al. (U.S. 2017/0228645 A1; herein referred to as Wang)
In regard to claim 3, the combination of Shmueli, Zhou, and Wesolowski  teaches finishing the small number of epochs in the second stage of the first training job (see Zhou ¶ [0028] as described for the rejection of claim 1 and is incorporated herein).
The combination of Shmueli, Zhou and Wesolowski fails to explicitly teach  further comprising: determining, by the computer, whether a final epoch of the second stage of the first training job is finished; in response to determining the final epoch of the second stage of the first training job being finished, taking, by the computer, a snapshot of the second stage of the first training job; recording, by the computer, training epochs and one or more metrics of the second stage of the first  training job; and registering, by the computer, a further second stage of the first training job that follows the  second stage of the first training job, wherein the further second stage of the first  training job that follows the first second-stage training job trains the first deep learning model that has been trained in the second stage of the first training job.  However Wang teaches further comprising: determining, by the computer, whether a final epoch of the second stage of the first training job is finished (see ¶ [0044] “ . . . Returning to FIG. 1, it may be observed that at Block 101, image data and a network to be trained are input. A maximum number of training epochs are specified and then in blocks 102 and 103, this maximum number of training epochs are completed . . .”);
in response to determining the final epoch of the second stage of the first training job being finished, taking, by the computer, a snapshot (e.g. batch) of the second stage of the first training job (see ¶ [0045] “ . . . an epoch is one forward pass and one backward pass of all the training examples. A batch size is the number of training examples in one forward/backward pass. The larger the batch size, the more memory is required. Finally, an iteration is a pass, and a number of iterations is the number of passes. Each pass using [batch size] number of examples. To be clear, one pass is one forward pass and one backward pass. The forward and backward passes are not counted independently.. . “)
recording, by the computer, training epochs and one or more metrics (e.g. parameters) of the  second stage of the first training job (see ¶ [0046] “ . . . As the training progresses, the training is applied on a batch and appropriate learning parameters are updated . . . “); and
registering, by the computer, a further second stage of the first training job that follows the t second stage of the first  training job (¶ [0046] “ . . . Upon completion, another batch is fetched and training is applied on that batch as well . . .”), wherein the further second stage of the first training job trains the deep learning model that has been trained in the second stage of the first training job (see ¶ ¶ [0046-0047] “ . . . This process continues until there are no more batches to process.  At block 103, the learned parameters are tested using test data and this overall process is repeated until the number of epochs reaches the maximum. At block 104, the output parameters are output after training . . . “).
It would have been obvious to one with ordinary skill in the art before the effective filing date of the applicant’s application to incorporate a system and method for training a convolutional neural network using an inconsistent stochastic gradient descent (ISGD) algorithm, where the algorithm is adjusts training batches used by the ISGD algorithm to be dynamically adjusted according to a determined loss for a given training batch which are classified into two sub states—well-trained or under-trained and the ISGD algorithm provides more iterations for under-trained batches while reducing iterations for well-trained ones, as taught by Wang, into a system and method for receiving a plurality of pairs of queries associated with a database, and at a training stage, training a machine learning model on a training set comprising: (i) the plurality of pairs of queries, and (ii) labels associated with containment rates between each of the pairs of queries over the database; and at an inference stage, applying the trained machine learning model to a pair of target queries, and wherein a neural network library to be searched according to a descending order of recognition accuracy on the training data set to obtain a first neural network sequence set, and taking first M neural networks in the first neural network sequence set as a first neural network set to be trained; performing first-stage training on the first neural network set to be trained by using the training data set, wherein the number of training cycles of the first-stage training is a second preset value; and taking a neural network with a number of trained cycles of the sum of the first preset value and the second preset value in the neural network library to be searched as a target neural network, and to estimate containment rates between the target pair of queries over the database, and further detecting a triggering event and halting a stage of  training and transferring to a different computer system to provide a second stage of training as taught by the combination of Shmueli, Zhou and Wesolowski.  Such incorporation enables adjustments to be made to the number of epochs or iterations between the training stages.
In regard to claim 10, the combination of Shmueli,  Zhou, Wesolowski and Wang teaches determine, by the computer, whether a final epoch of the second stage of the first training job is finished (see Wang ¶ [0044] as described for the rejection of claim 3 and is incorporated herein);
in response to determining the final epoch of the second stage of the first training job being finished, take, by the computer, a snapshot (e.g. batch)  of the second-stage of the first training job (see Wang ¶ [0045] as described for the rejection of claim 3 and is incorporated herein);
record, by the computer, training epochs and one or more metrics of the second stage of the first training job (see Wang ¶ [0046] as described for the rejection of claim 3 and is incorporated herein); and
register, by the computer, a further second stage of the first training job that follows the second stage of the first training job (see Wang ¶ [0046] as described for the rejection of claim 3 and is incorporated herein), wherein the further second stage of the first training job trains the first deep learning model that has been trained in the second stage of the first training job (see Wang ¶ ¶ [0046-0047] as described for the rejection of claim 3 and is incorporated herein).
The motivation to combine Wang with the combination of Shmueli, Zhou and Wesolowski is described for the rejection of claim 3 and is incorporated herein.
In regard to claim 17, the combination of Shmueli,  Zhou, Wesolowski and Wang teaches determine, by the computer, whether a final epoch of the second stage of the first training job is finished  (see Wang ¶ [0044] as described for the rejection of claim 3 and is incorporated herein);
in response to determining the final epoch of the second stage of the first training job being finished, take, by the computer, a snapshot  (e.g. batch) of the second stage of the first training job  (see Wang ¶ [0045] as described for the rejection of claim 3 and is incorporated herein);
record, by the computer, training epochs and one or more metrics of the second stage of the first training job  (see Wang ¶ [0046] as described for the rejection of claim 3 and is incorporated herein); and 
register, by the computer, a further second stage of the first training job that follows the second stage of the first training job (see Wang ¶ [0046] as described for the rejection of claim 3 and is incorporated herein), wherein the further second stage of the first training job trains the first deep learning model that has been trained in the second stage of the first training job (see Wang ¶ ¶ [0046-0047] as described for the rejection of claim 3 and is incorporated herein).
The motivation to combine Wang with the combination of Shmueli, Zhou and Wesolowski is described for the rejection of claim 3 and is incorporated herein.
Claims 4, 11, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Shmueli et al. (U.S. 2021/0056108 A1; herein referred to as Shmueli) in view of Zhou et al. (U.S. 2021/01216854 A1; herein referred to as Zhou) in further view of Wesolowski et al. (U.S. 2019/0114537 A1; herein referred to as Wesolowski) in further view of Wang et al. (U.S. 2017/0228645 A1; herein referred to as Wang) as applied to claims 3, 10, and 17 in further view of Zhang et al. (U.S. 2020/0175384 A1; herein referred to as Zhang) 
In regard to claim 4, the combination of Shmueli , Zhou , Wesolowski and Wang fails to explicitly teach determining, by the computer, whether an upper limit of the training epochs is reached; in response to determining the upper limit of the training epochs being reached, stopping, by the computer, the second stage of the first training job; and in response to determining the upper limit of the training epochs being not reached, executing, by the computer, a next epoch of the second stage of the first training job.  However Zhang teaches determining, by the computer, whether an upper limit of the training epochs is reached (see ¶ [0103] “ . . . the system interactively selects the best model using the following method. The system receives a training time upper limit and other information, for example, from a user, and uses the received information to determine best number of training . . .”) ; 
in response to determining the upper limit of the training epochs being reached (see ¶ [0103] “ . . . “ . . . the system is given the training time upper limit “t” and other information, e.g., the GPU model, training data availability, model size, and network bandwidth etc. The system then determines the number of epochs to use to train for the model using the formula:
[00003] num epochs = t -- .Math. size mode × speed download - time construct training set / GPU_time .Math. _per .Math. _epoch  .. .”) , stopping, by the computer, the second stage of the first  training job (see ¶ [0104] “ . . . The system may measure the training loss and stop the training iterations when the loss is smaller than a threshold. For example, the system may use a very small validation set and select the number of training iterations by using a threshold on validation loss/accuracy to avoid overfitting. . . “); and in response to determining the upper limit of the training epochs being not reached, executing, by the computer, a next epoch of the second stage of the first training job (see ¶ [0104] “ . . . Considering the trade-off between accuracy and time, the system considers two hyper-parameters on which the performance the system mainly depends (a) the number of web-crawled training images and (b) the number of iterations for the algorithm. For (a) the number of images, in one embodiment, the system collects about 100 images for the new class. It is observed that this number is a good sweet spot in accuracy/time tradeoff but it is possible to use less examples (e.g. 50) to improve speed. For (b) number of training iterations, in one embodiment, number of iterations is fixed (between 5-10) . . .”) .
It would have been obvious to one with ordinary skill in the art before the effective filing date of the applicant’s application to incorporate a system and method for incremental learning  without forgetting in image classification and/or object detection, as taught by Zhang into a system and method for receiving a plurality of pairs of queries associated with a database, and at a training stage, training a machine learning model on a training set comprising: (i) the plurality of pairs of queries, and (ii) labels associated with containment rates between each of the pairs of queries over the database; and at an inference stage, applying the trained machine learning model to a pair of target queries, and wherein a neural network library to be searched according to a descending order of recognition accuracy on the training data set to obtain a first neural network sequence set, and taking first M neural networks in the first neural network sequence set as a first neural network set to be trained; performing first-stage training on the first neural network set to be trained by using the training data set, wherein the number of training cycles of the first-stage training is a second preset value; and taking a neural network with a number of trained cycles of the sum of the first preset value and the second preset value in the neural network library to be searched as a target neural network, and to estimate containment rates between the target pair of queries over the database, and further detecting a triggering event and halting a stage of  training and transferring to a different computer system to provide a second stage of training while also dynamically adjusting  training batches as taught by the combination of Shmueli, Zhou, Wesolowski and Wang.  Such incorporation puts limits on a number of training iterations. 
In regard to claim 11, the combination of Shmueli, Zhou, Wang, Wesolowski and Zhang teaches  determine, by the computer, whether an upper limit of the training epochs is reached (see Zhang ¶ [0103] as described for the rejection of claim 4 and is incorporated herein);
in response to determining the upper limit of the training epochs being reached (see Zhang ¶ [0103] as described for the rejection of claim 4 and is incorporated herein), stop, by the computer, the  second stage of the first training job (see Zhang ¶ [0104] as described for the rejection of claim 4 and is incorporated herein); and
in response to determining the upper limit of the training epochs being not reached, execute, by the computer, a next epoch of the second stage of the first training job (see Zhang ¶ [0104] as described for the rejection of claim 4 and is incorporated herein).
The motivation to combine Zhang with the combination of Shmueli, Zhou, Wesolowski and Wang is described for the rejection of claim 4 and is incorporated herein.
In regard to claim 18, the combination of Shmueli, Zhou, Wesolowski .Wang, and Zhang teaches determine, by the computer, whether an upper limit of the training epochs is reached (see Zhang ¶ [0103] as described for the rejection of claim 4 and is incorporated herein);
in response to determining the upper limit of the training epochs being reached(see Zhang ¶ [0103] as described for the rejection of claim 4 and is incorporated herein), stop, by the computer, the t second stage of the first training Job (see Zhang ¶ [0104] as described for the rejection of claim 4 and is incorporated herein); and
in response to determining the upper limit of the training epochs being not reached, execute, by the computer, a next epoch of the second stage of the first training job (see Zhang ¶ [0104] as described for the rejection of claim 4 and is incorporated herein).
The motivation to combine Zhang with the combination of Shmueli, Zhou, and Wang is described for the rejection of claim 4 and is incorporated herein.
  Conclusion
There are prior art made of record which are not relied upon but are considered pertinent to applicant’s disclosure.  They are listed on the PTO-892 accompanying this action.
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to JAMES N FIORILLO whose telephone number is (571)272-9909.  The examiner can normally be reached on 7:30 - 5 PM Mon - Fri..
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, John A. Follansbee can be reached on 571-272-3964.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/JAMES N FIORILLO/Examiner, Art Unit 2444