DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statements (IDS) submitted on 01/17/2018, 10/25/2018, and 05/16/2019 are in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statements are being considered by the examiner.
Drawings
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(4) because reference characters "166" in Fig. 2 and "170" in specification paragraphs [0039]-[0040] have both been used to designate driving assistance; reference characters "170" in Fig. 2 and "172" in specification paragraphs [0039]-[0040] have both been used to designate autonomous driving.  Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(4) because reference character “166” has been used to designate both parking assistance and driving assistance in Fig. 2; reference character “170” has been used to designate both autonomous driving in Fig. 2 and driving assistance in specification paragraphs [0039]-[0040]; reference character “526” has been used to designate both the increment counter operational step in specification paragraph [0070] and the determine if counter is less than 0 operational step in specification paragraph [0071].  Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(5) because they do not include the following reference sign(s) mentioned in the description: 172 used in specification paragraphs [0039]-[0040]; 158 used in specification paragraph [0076].  Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.
The drawings are objected to as failing to comply with 37 CFR 1.84(p)(5) because they include the following reference character(s) not mentioned in the description: 528 used in Fig. 5A.  Corrected drawing sheets in compliance with 37 CFR 1.121(d), or amendment to the specification to add the reference character(s) in the description in compliance with 37 CFR 1.121(b) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.
Specification
The abstract of the disclosure is objected to because of the following minor informalities:
In lines 5-6, “are generated based a sample data set” should read “are generated based on a sample data set”
Correction is required.  See MPEP § 608.01(b).

The disclosure is objected to because of the following informalities: 
In specification paragraph [0049], “are shown connected to the output layer 350 indirectly by via the merger layer 340” should read “are shown connected to the output layer 350 indirectly via the merger layer 340”
Appropriate correction is required.
Double Patenting
A rejection based on double patenting of the “same invention” type finds its support in the language of 35 U.S.C. 101 which states that “whoever invents or discovers any new and useful process... may obtain a patent therefor...” (Emphasis added). Thus, the term “same invention,” in this context, means an invention drawn to identical subject matter. See Miller v. Eagle Mfg. Co., 151 U.S. 186 (1894); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Ockert, 245 F.2d 467, 114 USPQ 330 (CCPA 1957).
A statutory type (35 U.S.C. 101) double patenting rejection can be overcome by canceling or amending the claims that are directed to the same invention so they are no longer coextensive in scope. The filing of a terminal disclaimer cannot overcome a double patenting rejection based upon 35 U.S.C. 101.
Claims 1, 2, 4, 6, 7, 9-13, 17, 18, and 20-23 are provisionally rejected under 35 U.S.C. 101 as claiming the same invention as that of claims 1, 2, 4, 6, 7, 9-13, 17, 18, and 20-23 of copending Application No. 16/248,543 (reference application). This is a provisional statutory double patenting rejection since the claims directed to the same invention have not in fact been patented.
Instant Application
US Application No. 16/248,543 (reference application)
Claim 1:

A system, comprising: 
a processor; 
a memory coupled to the processor, the memory storing executable 5instructions that, when executed by the processor, cause the processor to: 
receive a sample data set D {(si, ai, si+1,ri)}, wherein si; is a current state of the object in the environment, ai is the action chosen for the current state, si+1 is a subsequent state of the object and the environment and ri is a reward value for taking an action, ai, in a state, si, the value of which is 10determined in accordance with a reward function; 
apply, to the sample data set, a multi-layer neural network, each layer in the multi-layer neural network comprising a plurality of nodes, each node in each layer having a corresponding weight, wherein the neural network is configured to:  
15(i) generate a first set of policy values Q(si, ai) for each state-action pair si, ai in the sample data set D using an action-value function denoted the Q function; 
(ii) generate a second set of policy values Q (si+1,a) for each subsequent state si+1, for all tuples in the sample data set D for each 20action in the set of all possible actions using the Q function; 
(iii) generate an approximate action-value function, denoted the Q* function, from the first set of policy values Q(si,ai) for the current state si and the action ai selected for the current state si and the second set of policy values Q (si+1,a) for the subsequent state si+1 after the 25selected action ai; 
(iv) generate a training target for the neural network using the Q* function;  
30(v) calculate a training error as the difference between the training target and the policy value Q (si,ai) for the corresponding state-action pair in the sample data set D; and 
(vi) update at least some of the parameters of the neural network 5to minimize the training error.
Claim 1:

A system, comprising: 
a processor; 
a memory coupled to the processor, the memory storing executable 5instructions that, when executed by the processor, cause the processor to: 
receive a sample data set D {(si, ai, si+1,ri)}, wherein si; is a current state of the object in the environment, ai is the action chosen for the current state, si+1 is a subsequent state of the object and the environment and ri is a reward value for taking an action, ai, in a state, si, the value of which is 10determined in accordance with a reward function; 
apply, to the sample data set, a multi-layer neural network, each layer in the multi-layer neural network comprising a plurality of nodes, each node in each layer having a corresponding weight, wherein the neural network is configured to:  
15(i) generate a first set of policy values Q(si, ai) for each state- action pair si, ai in the sample data set D using an action-value function denoted the Q function; 
(ii) generate a second set of policy values Q (si+1,a) for each subsequent state si+1, for all tuples (si, ai, si+1,ri) in the sample data set D 20for each action in the set of all possible actions using the Q function; 
(iii) generate an approximate action-value function, denoted the Q* function, from the first set of policy values Q(si,ai) for the current state si and the action ai selected for the current state si and the second set of policy values Q (si+1,a) for the subsequent state si+1 after 25the selected action ai; 
(iv) generate a training target for the neural network using the Q* function;  
33(v) calculate a training error as the difference between the training target and the policy value Q (si,ai) for the corresponding state-action pair in the sample data set D; and 
(vi) update at least some of the parameters of the neural 5network to minimize the training error.
Claim 2:

The system of claim 1, wherein the operations (iii) to (vi) are repeated for each tuple in the sample data set D,
Claim 2:

The system of claim 1, wherein the operations (iii) to (vi) are repeated for each tuple (si, ai, si+1,ri) in the sample data set D,
Claim 4:

The system of claim 2, wherein the weight vector ω represents the weights of the nodes of the output layer of the neural network.
Claim 4:

The system of claim 2, wherein the weight vector ω represents the weights of the nodes of the output layer of the neural network.
Claim 6:

The system of claim 1, wherein the at least some of the parameters of the 5neural network are updated using a gradient descent that minimizes a mean square error (MSE) between the training target and the policy value Q(si,ai) for the corresponding state-action pair in the sample data set D.
Claim 6:

The system of claim 1, wherein the at least some of the parameters of the neural network are updated using a gradient descent that minimizes a mean square error (MSE) between the training target and the policy value Q(si,ai) for the corresponding state-action pair in the sample data set D.
Claim 7:

The system of claim 6, wherein the MSE is minimized using a least mean square (LMS) algorithm.
Claim 7:

The system of claim 6, wherein the MSE is minimized using a least mean square (LMS) algorithm.
Claim 9:

The system of claim 1, wherein the state of the object in the environment is sensed using one or more of cameras, LIDAR and RADAR, wherein the current state of the object in the environment is described by one or more of images, LIDAR 20measurements and RADAR measurements.
Claim 9:

The system of claim 1, wherein the state of the object in the environment is 20sensed using one or more of cameras, LIDAR and RADAR, wherein the current state of the object in the environment is described by one or more of images, LIDAR measurements and RADAR measurements.
Claim 10:

The system of claim 1, wherein the action comprises any one or a combination of a steering angle for a steering unit, a throttle value for a throttle unit and braking value for a braking unit.
Claim 10:

The system of claim 1, wherein the action comprises any one or a combination of a steering angle for a steering unit, a throttle value for a throttle 25unit and braking value for a braking unit.
Claim 11:

The system of claim 1, wherein the object is a vehicle, robot or drone.
Claim 11:

The system of claim 1, wherein the object is a vehicle, robot or drone.
Claim 12:

A method of training a neural network, comprising: 
(i) generating a first set of policy values Q(si,ai) for each state-action pair si, ai in a sample data set D {(si, ai, si+1,ri)} using an action-value function denoted the Q function, wherein si is a current state of the object in the environment, ai is the action 5chosen for the current state, si+1 is a subsequent state of the object and the environment and ri is a reward value for taking an action, ai, in a state, si, the value of which is determined in accordance with a reward function; 
(ii) generating a second set of policy values Q (si+1,a) for each subsequent state si+1 for all tuples in the sample data set D for each action in the set of all possible 10actions using the Q function; 

(iii) generating an approximate action-value function, denoted the Q* function, from the first set of policy values Q(si,ai) for the current state si and the action ai selected for the current state si and the second set of policy values Q (si+1, a) for the subsequent state si+1 after the selected action ai;  
15(iv) generating a training target for the neural network using the Q* function; 
(v) calculating a training error as the difference between the training target and the policy value Q (si, ai) for the corresponding state-action pair in the sample data set D; and 
(vi) updating at least some of the parameters of the neural network to minimize 20the training error.
Claim 12:

A method of training a neural network, comprising: 
(i) generating a first set of policy values Q(si,ai) for each state-action pair si, ai in a sample data set D {(si, ai, si+1,ri)} using an action-value function denoted the 5Q function, wherein si is a current state of the object in the environment, ai is the action chosen for the current state, si+1 is a subsequent state of the object and the environment and ri is a reward value for taking an action, ai, in a state, si, the value of which is determined in accordance with a reward function; 
(ii) generating a second set of policy values Q (si+1,a) for each subsequent 10state si+1 for all tuples (si, ai, si+1,ri) in the sample data set D for each action in the set of all possible actions using the Q function; 
(iii) generating an approximate action-value function, denoted the Q* function, from the first set of policy values Q(si,ai) for the current state si and the action ai selected for the current state si and the second set of policy values 15Q (si+1, a) for the subsequent state si+1 after the selected action ai; 
(iv) generating a training target for the neural network using the Q* function; 
(v) calculating a training error as the difference between the training target and the policy value Q (si,ai) for the corresponding state-action pair in the sample 20data set D; and 
(vi) updating at least some of the parameters of the neural network to minimize the training error.
Claim 13:

The method of claim 12, wherein the operations (iii) to (vi) are repeated for each tuple in the sample data set D,
Claim 13:

The method of claim 12, wherein the operations (iii) to (vi) are repeated for each tuple (si, ai, si+1,ri) in the sample data set D.
Claim 17:

The method of claim 12, wherein the at least some of the parameters of the neural network are updated using a gradient descent that minimizes a mean square error (MSE) between the training target and the policy value Q(si,ai) for the corresponding state-action pair in the sample data set D.
Claim 17:

The method of claim 12, wherein the at least some of the parameters of the 20neural network are updated using a gradient descent that minimizes a mean square error (MSE) between the training target and the policy value Q(si,ai) for the corresponding state-action pair in the sample data set D.
Claim 18:

The method of claim 17, wherein the MSE is minimized using a least mean square (LMS) algorithm.
Claim 18:

The method of claim 17, wherein the MSE is minimized using a least mean square (LMS) algorithm.
Claim 20:

The method of claim 12, wherein the state of the object in the environment is sensed using one or more of cameras, LIDAR and RADAR, wherein the current state of the object in the environment is described by one or more of images, LIDAR measurements and RADAR measurements.
Claim 20:

The method of claim 12, wherein the state of the object in the environment is sensed using one or more of cameras, LIDAR and RADAR, wherein the current 10state of the object in the environment is described by one or more of images, LIDAR measurements and RADAR measurements.
	
Claim 21:

The method of claim 12, wherein the action comprises any one or a combination of a steering angle for a steering unit, a throttle value for a throttle unit and braking value for a braking unit.
Claim 21:

The method of claim 12, wherein the action comprises any one or a combination of a steering angle for a steering unit, a throttle value for a throttle unit and braking value for a braking unit.
Claim 22:

The method of claim 12, wherein the object is a vehicle, robot or drone.
Claim 22:

The method of claim 12, wherein the object is a vehicle, robot or drone.
Claim 23:

A non-transitory machine readable medium having tangibly stored thereon 15executable instructions for execution by a processor of a computing device, wherein the executable instructions, when executed by the processor of the computing device, cause the computing device to: 
(i) generate a first set of policy values Q(si, ai) for each state-action pair si, ai in a sample data set D {(si, ai, si+1,ri)} using an action-value function denoted the Q 20function, wherein si is a current state of the object in the environment, ai is the action chosen for the current state, si+1 is a subsequent state of the object and the environment and ri is a reward value for taking an action, ai, in a state, si, the value of which is determined in accordance with a reward function; 
(ii) generate a second set of policy values Q (si+1,a) for each subsequent state 25si,1 for all tuples in the sample data set D for each action in the set of all possible actions using the Q function;  

35(iii) generate an approximate action-value function, denoted the Q* function, from the first set of policy values Q(si,ai) for the current state si and the action ai selected for the current state si and the second set of policy values Q (si+1, a) for the subsequent state si+1 after the selected action a1;  
5(iv) generate a training target for the neural network using the Q* function; 
(v) calculate a training error as the difference between the training target and the policy value Q (si,ai) for the corresponding state-action pair in the sample data set D; and 
(vi) update at least some of the parameters of the neural network to minimize 10the training error.
Claim 23:

A non-transitory machine readable medium having tangibly stored thereon executable instructions for execution by a processor of a computing device, wherein the executable instructions, when executed by the processor of the computing device, cause the computing device to:  
20(i) generate a first set of policy values Q(si, ai) for each state-action pair si, ai in a sample data set D {(si, ai, si+1,ri)} using an action-value function denoted the Q function, wherein si is a current state of the object in the environment, ai is the action chosen for the current state, si+1 is a subsequent state of the object and the environment and ri is a reward value for taking an action, ai, in a state, si, the value 25of which is determined in accordance with a reward function;  
38(ii) generate a second set of policy values Q (si+1,a) for each subsequent state si+1 for all tuples (si, ai, si+1,ri) in the sample data set D for each action in the set of all possible actions using the Q function; 
(iii) generate an approximate action-value function, denoted the Q* function, 5from the first set of policy values Q(si,ai) for the current state si and the action ai selected for the current state si and the second set of policy values Q (si+1,a) for the subsequent state si+1 after the selected action ai; 
(iv) generate a training target for the neural network using the Q* function; 
(v) calculate a training error as the difference between the training target 10and the policy value Q (si,ai) for the corresponding state-action pair in the sample data set D; and 
(vi) update at least some of the parameters of the neural network to minimize the training error.


As shown in the table above, instant claims 1, 2, 4, 6, 7, 9-13, 17, 18, and 20-23 are drawn to identical subject matter as reference claims 1, 2, 4, 6, 7, 9-13, 17, 18, and 20-23 (i.e. the scope of the claims is the same).

The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159. See MPEP § 2146 et seq. for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.
Claims 3, 5, 8, 14-16, and 19 are provisionally rejected on the ground of nonstatutory double patenting as being unpatentable over claims 3, 5, 8, 14-16, and 19 of copending Application No. 16/248,543 (reference application). Although the claims at issue are not identical, they are not patentably distinct from each other because instant claims 3, 5, 8, 14-16, and 19 (the claims being examined) are “generic to a species or sub-genus claimed in a conflicting patent or application, i.e., the entire scope of the reference claim falls within the scope of the examined claim”. See MPEP 804(II)(B)(1).
Instant Application
US Application No. 16/248,543 (reference application)
Claim 3:

The system of claim 1, wherein the neural network is configured to generate the Q* function by: 
10initializing a matrix A and a vector b; 
for each tuple in the sample data set D: 

selecting an action, a*, that results in maximum value of Q (si+1,a) from the set of all possible actions (a* = argmaxaQ (si+1,a)); and 



updating the value of the matrix A and the vector b using the following 15equations

    PNG
    media_image1.png
    76
    293
    media_image1.png
    Greyscale

wherein γ is a discount factor between 0 and 1; and 
calculating a weight vector ω according to the following equation:

    PNG
    media_image2.png
    40
    97
    media_image2.png
    Greyscale

Claim 3:

The system of claim 1, wherein the neural network is configured to generate the Q* function by: 

10initializing a matrix A and a vector b; 
for each tuple (si, ai, si+1,ri) in the sample data set D: 
selecting an action, a*, that results in maximum value of Q (si+1, a) from the set of all possible actions (a* = argmaxaQ (si+1, a)); 
generating a vector from an output layer of the neural network φ15(si, ai), φ*(si+1, a*) using φ(s) and tabular action; 
updating the value of the matrix A and the vector b using the following equations

    PNG
    media_image3.png
    76
    297
    media_image3.png
    Greyscale

wherein γ is a discount factor between 0 and 1; and 
calculating a weight vector ω according to the following equation:

    PNG
    media_image4.png
    36
    90
    media_image4.png
    Greyscale

Claim 5:

The system of claim 1, wherein the neural network is configured to generate a training target by: 
31selecting an action, a*, that results in maximum value of Q (si,a)Tω from the set of all possible actions (a* = argmaxaQ(si,a)Tω); and 
setting the training target for the neural network as Q (si,a*)Tω.
Claim 5:

The system of claim 1, wherein the neural network is configured to generate a training target by: 

selecting an action, a*, that results in maximum value of Q (si,a)Tω from the set of all possible actions (a* = argmaxaQ(si, a) = φ(si, a)Tω); and
5setting the training target for the neural network as Q (si,a*) = φ(si,a*)Tω.
Claim 8:

The system of claim 6, wherein the MSE is defined in accordance with the following equation:

    PNG
    media_image5.png
    63
    205
    media_image5.png
    Greyscale

wherein n is the number of tuples in the sample data set D, Q*(si,a*)Tω) is the training target and Q (si,ai) is the policy value for the corresponding state-action pair in the 15sample data set D, and wherein the sum is first over the states in the sample data set and then over all the actions.
Claim 8:

The system of claim 6, wherein the MSE is defined in accordance with the following equation:

    PNG
    media_image6.png
    65
    201
    media_image6.png
    Greyscale


wherein n is the number of tuples in the sample data set D, φ(si,a*)Tω is the training target and Q (si,ai) is the policy value for the corresponding state-action pair in the sample data set D, and wherein the sum is first over the states in the sample data set and then over all the actions.
Claim 14:

The method of claim 12, wherein generating the Q* function comprises: 
initializing a matrix A and a vector b; 
25for each tuple in the sample data set D: 

33selecting an action, a*, that results in maximum value of Q (si+1,a) from the set of all possible actions (a* = argmaxaQ (si+1,a)); and 



updating the value of the matrix A and the vector b using the following equations

    PNG
    media_image7.png
    83
    293
    media_image7.png
    Greyscale

wherein γ is a discount factor between 0 and 1; and 
calculating a weight vector ω according to the following equation:

    PNG
    media_image8.png
    36
    93
    media_image8.png
    Greyscale

Claim 14:

The method of claim 12, wherein generating the Q* function comprises: 
initializing a matrix A and a vector b; 
36for each tuple (si, ai, si+1,ri) in the sample data set D: 
selecting an action, a*, that results in maximum value of Q (si+1, a) from the set of all possible actions (a* = argmaxaQ (si+1, a)); 
generating a vector from an output layer of the neural network φ 5(si, ai), φ*(si+1, a*) using φ(s) and tabular action; and 
updating the value of the matrix A and the vector b using the following equations

    PNG
    media_image9.png
    72
    308
    media_image9.png
    Greyscale

wherein γ is a discount factor between 0 and 1; and 
calculating a weight vector ω according to the following equation:

    PNG
    media_image10.png
    38
    93
    media_image10.png
    Greyscale

Claim 15:

The method of claim 14, wherein the weight vector ω represents the weights of the nodes of the output layer of the neural network.
Claim 15:

The method of claim 14, wherein the weight vector ω represents the weights of the nodes of the output layer of the neural network.
Claim 16:

The method of claim 12, wherein generating the training target comprises: 
selecting an action, a*, that results in maximum value of Q (si,a)Tω from the set of all possible actions (a* = argmaxaQ(si,a)Tω); and 
15setting the training target for the neural network as Q (si,a*)Tω.
Claim 16:

The method of claim 12, wherein generating the training target comprises: 
selecting an action, a*, that results in maximum value of φ(si,a)Tω from the set of all possible actions (a* = argmaxaQ(si,a) = φ(si,a)Tω); and
setting the training target for the neural network as Q (si,a*) = φ(si,a*)TW.
Claim 19:

The method of claim 17, wherein the MSE is defined in accordance with the following equation: 

    PNG
    media_image11.png
    69
    212
    media_image11.png
    Greyscale

34wherein n is the number of tuples in the sample data set D, Q(si,a*)Tω is the training target and Q (si,ai) is the policy value for the corresponding state-action pair in the sample data set D, and wherein the sum is first over the states in the sample data set 5and then over all the actions.
Claim 19:

The method of claim 17, wherein the MSE is defined in accordance with the following equation:

    PNG
    media_image12.png
    67
    202
    media_image12.png
    Greyscale


wherein n is the number of tuples in the sample data set D, φ(si,a*)Tω is the 5training target and Q(si,ai) is the policy value for the corresponding state-action pair in the sample data set D, and wherein the sum is first over the states in the sample data set and then over all the actions.


As shown in the table above, all claimed features in instant claims 3, 5, 8, 14-16, and 19 are disclosed in reference claims 3, 5, 8, 14-16, and 19 (underlined elements). While the two sets of claims are not identical, instant claims 3, 5, 8, 14-16, and 19 are anticipated by reference claims 3, 5, 8, 14-16, and 19. It is evident from the table that all limitations in instant claims 3, 5, 8, 14-16, and 19 are linguistically comparable to the limitations in reference claims 3, 5, 8, 14-16, and 19 except for the use of φ values instead of Q values in some limitations, for which an explanation is provided below:
The limitations of claims 3, 5, 8, 14-16, and 19 in the reference claims disclose the use of φ(s, a) and φ*(s, a) for generating the Q* function and for calculating the MSE, this feature anticipates the limitations of instant claims 3, 5, 8, 14-16, and 19 of using Q(s, a) and Q*(s, a) for generating the Q* function and for calculating the MSE because reference claims 3 and 14 disclose that φ(s, a) and φ*(s, a) are generated from the output of a neural network for a corresponding state-action pair, which is the same as how Q(s, a) and Q*(s, a) are generated (i.e. φ(s, a) and φ*(s, a) correspond to Q(s, a) and Q*(s, a)).
This is a provisional nonstatutory double patenting rejection because the patentably indistinct claims have not in fact been patented.
Claim Objections
Claims 2, 4, and are objected to because of the following informalities: 
In claim 2, line 2, “in the sample data set D,” should read “in the sample data set D.”
In claim 13, line 2, “in the sample data set D,” should read “in the sample data set D.”
Each dependent claim of claim 2 is objected to based on the same rationale as the claim from which it depends.
Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-23 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
	Claim 1 recites the limitation “the object” in line 6. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the object” has been interpreted as “an object”.
Claim 1 recites the limitation “the environment” in line 6. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the environment” has been interpreted as “an environment”.
Claim 1 recites the limitation “the action” in line 6. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the action” has been interpreted as “an action”.
Claim 1 recites the limitation “the value” in line 8. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the value” has been interpreted as “a value”.
Claim 1 recites the limitation “the Q function” in line 16. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the Q function” has been interpreted as “a Q function”.
Claim 1 recites the limitation “the set of all possible actions” in line 19. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the set of all possible actions” has been interpreted as “a set of all possible actions”.
Claim 1 recites the limitation “the Q* function” in lines 20-21. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the Q* function” has been interpreted as “a Q* function”.
Claim 1 recites the limitation “the difference” in line 27. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the difference” has been interpreted as “a difference”.
Claim 1 recites the limitation “the policy value Q(si, ai)” in line 28. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the policy value Q(si, ai)” has been interpreted as “a policy value Q(si, ai)”.
Claim 1 recites the limitation “the corresponding state-action pair” in lines 28-29. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the corresponding state-action pair” has been interpreted as “a corresponding state-action pair”.
Claim 1 recites the limitation “the parameters” in line 30. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the parameters” has been interpreted as “parameters”.
Claim 2 recites the limitation “the operations” in line 1. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the operations” has been interpreted as “operations”.
Claim 3 recites the limitation “the value” in line 7. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the value” has been interpreted as “value”.
Claim 4 recites the limitation “the weight factor ω” in line 1. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the weight factor ω” has been interpreted as “a weight factor ω”.
Claim 4 recites the limitation “the output layer” in line 2. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the output layer” has been interpreted as “an output layer”.
Claim 12 recites the limitation “the Q function” in lines 3-4 . There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the Q function” has been interpreted as “a Q function”.
Claim 12 recites the limitation “the object” in line 4. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the object” has been interpreted as “an object”.
Claim 12 recites the limitation “the environment” in line 4. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the environment” has been interpreted as “an environment”.
Claim 12 recites the limitation “the action” in line 4. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the action” has been interpreted as “an action”.
Claim 12 recites the limitation “the value” in line 6. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the value” has been interpreted as “a value”.
Claim 12 recites the limitation “the set of all possible actions” in lines 9-10. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the set of all possible actions” has been interpreted as “a set of all possible actions”.
Claim 12 recites the limitation “the Q* function” in line 11. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the Q* function” has been interpreted as “a Q* function”.
Claim 12 recites the limitation “the difference” in line 16. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the difference” has been interpreted as “a difference”.
Claim 12 recites the limitation “the policy value Q(si, ai)” in line 17. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the policy value Q(si, ai)” has been interpreted as “a policy value Q(si, ai)”.
Claim 12 recites the limitation “the corresponding state-action pair” in line 17. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the corresponding state-action pair” has been interpreted as “a corresponding state-action pair”.
Claim 12 recites the limitation “the parameters” in line 19. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the parameters” has been interpreted as “parameters”.
Claim 13 recites the limitation “the operations” in line 1. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the operations” has been interpreted as “operations”.
Claim 14 recites the limitation “the value” in line 6. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the value” has been interpreted as “a value”.
Claim 15 recites the limitation “the output layer” in line 2. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the output layer” has been interpreted as “an output layer”.
Claim 23 recites the limitation “the Q function” in lines 6-7. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the Q function” has been interpreted as “a Q function”.
Claim 23 recites the limitation “the object” in line 7. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the object” has been interpreted as “an object”.
Claim 23 recites the limitation “the environment” in line 7. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the environment” has been interpreted as “an environment”.
Claim 23 recites the limitation “the action” in line 7. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the action” has been interpreted as “an action”.
Claim 23 recites the limitation “the value” in line 9. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the value” has been interpreted as “a value”.
Claim 23 recites the limitation “the set of all possible actions” in lines 12-13. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the set of all possible actions” has been interpreted as “a set of all possible actions”.
Claim 23 recites the limitation “the Q* function” in line 14. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the Q* function” has been interpreted as “a Q* function”.
Claim 23 recites the limitation “the difference” in line 19. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the difference” has been interpreted as “a difference”.
Claim 23 recites the limitation “the policy value Q(si, ai)” in line 20. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the policy value Q(si, ai)” has been interpreted as “a policy value Q(si, ai)”.
Claim 23 recites the limitation “the corresponding state-action pair” in line 20. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the corresponding state-action pair” has been interpreted as “a corresponding state-action pair”.
Claim 23 recites the limitation “the parameters” in line 22. There is insufficient antecedent basis for this limitation in the claim. For examination purposes, “the parameters” has been interpreted as “parameters”.
	Each dependent claim is rejected based on the same rationale as the claim from which it depends.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-23 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Regarding Claim 1,
Claim 1 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 1 is directed to a system, which is directed to a machine, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“(i) generate a first set of policy values Q(si, ai) for each state-action pair si, ai in the sample data set D using an action-value function denoted the Q function”
“(ii) generate a second set of policy values Q (si+1,a) for each subsequent state si+1, for all tuples in the sample data set D for each action in the set of all possible actions using the Q function”
“(iii) generate an approximate action-value function, denoted the Q* function, from the first set of policy values Q(si, ai) for the current state si and the action ai selected for the current state si and the second set of policy values Q (si+1, a) for the subsequent state si+1 after the 25selected action ai”
“(iv) generate a training target for the neural network using the Q* function”
“(v) calculate a training error as the difference between the training target and the policy value Q (si, ai) for the corresponding state-action pair in the sample data set D”
“(vi) update at least some of the parameters of the neural network 5to minimize the training error”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) and mathematical concepts (mathematical relationships, mathematical formulas or equations, mathematical calculations) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)) and insignificant extra-solution activity (See MPEP 2106.05(g)). The above limitations in the context of this claim encompass generating a first set of policy values (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can use an action value function (Q function) to generate a first set of policy values for each state-action pair in the sample data set), generating a second set of policy values (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can use an action value function (Q function) to generate a second set of policy values for each subsequent state in the sample data set), generating an approximate action-value function (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can use the first and second set of policy values for a selected action to generate an approximate action-value function (Q* function)), generating a training target (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can use the Q* function to generate a training target), calculating a training error (corresponds to mathematical calculation), and updating parameters of the neural network (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can update the neural network parameters so as to minimize the training error).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)). The limitations:
“a processor”
“a memory coupled to the processor”
“apply, to the sample data set, a multi-layer neural network, each layer in the multi-layer neural network comprising a plurality of nodes, each node in each layer having a corresponding weight” 
As drafted, are additional elements that amount to no more than mere instructions to apply the exception for the abstract ideas. See MPEP 2106.05(f). The limitation:
“receive a sample data set D {(si, ai, si+1, ri)}, wherein si; is a current state of the object in the environment, ai is the action chosen for the current state, si+1 is a subsequent state of the object and the environment and ri is a reward value for taking an action, ai, in a state, si, the value of which is determined in accordance with a reward function”
As drafted, is an additional element that corresponds to insignificant extra-solution activity. In particular, the additional element is merely directed towards receiving data. See MPEP 2106.05(g). Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe a processor, memory, and multi-layer neural network for applying the abstract ideas) or insignificant extra-solution activity (i.e. receiving data). Furthermore, the “receive …” limitation is insignificant extra-solution activity that is well-understood, routine, and conventional according to MPEP 2106.05(d) (“The courts have recognized the following computer functions as well‐understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity… i. Receiving or transmitting data over a network). Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 2,
Claim 2 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 2 is directed to a system, which is directed to a machine, one of the statutory categories.
Step 2A Prong One Analysis: The limitation:
“wherein the operations (iii) to (vi) are repeated for each tuple in the sample data set D”
As drafted, under its broadest reasonable interpretation, covers mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) and mathematical concepts (mathematical relationships, mathematical formulas or equations, mathematical calculations) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)) and insignificant extra-solution activity (See MPEP 2106.05(g)). The above limitation in the context of this claim encompasses repeating operations (iii)-(vi) for each tuple in the sample data set (corresponds to evaluation and judgement (operations (iii), (iv), and (vi) as discussed above in claim 1) and mathematical calculation (operation (v) as discussed above in claim 1); in particular, a human, with the assistance of pen and paper, can repeatedly perform the operations (iii)-(vi) for each tuple in the sample data set).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)) or insignificant extra-solution activity (See MPEP 2106.05(g)). The recitation of additional elements in claim 1 of a processor, memory, and a multi-layer neural network, as drafted, are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. In addition, the additional element of “receiving…” amounts to no more than insignificant extra-solution activity for receiving data. The additional elements do not integrate the abstract ideas into a practical application. 
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe a processor, memory, and multi-layer neural network for applying the abstract ideas) or insignificant extra-solution activity (i.e. receiving data). Furthermore, the “receive …” limitation is insignificant extra-solution activity that is well-understood, routine, and conventional according to MPEP 2106.05(d) (“The courts have recognized the following computer functions as well‐understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity… i. Receiving or transmitting data over a network). Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 3,
Claim 3 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 3 is directed to a system, which is directed to a machine, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“initializing a matrix A and a vector b”
“selecting an action, a*, that results in maximum value of Q (si+1, a) from the set of all possible actions (a* = argmaxaQ (si+1, a))”
“updating the value of the matrix A and the vector b using the following equations
    PNG
    media_image13.png
    86
    304
    media_image13.png
    Greyscale
wherein y is a discount factor between 0 and 1” 
“calculating a weight vector ω according to the following equation:
    PNG
    media_image14.png
    43
    87
    media_image14.png
    Greyscale
” 
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) and mathematical concepts (mathematical relationships, mathematical formulas or equations, mathematical calculations) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)) and insignificant extra-solution activity (See MPEP 2106.05(g)). The above limitations in the context of this claim encompass generating the Q* function by initializing a matrix A and a vector b (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can set initial values (initialize) for a matrix A and a vector b), selecting an action that results in the maximum value of Q (si+1, a) (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can determine the action that results in maximum value of Q (si+1, a) and then select that action), updating the value of the matrix A and the vector b using the given equations (corresponds to mathematical equations and calculation), and calculating a weight vector using the given equation (corresponds to mathematical equation and calculation).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)) or insignificant extra-solution activity (See MPEP 2106.05(g)). The recitation of additional elements in claim 1 of a processor, memory, and a multi-layer neural network, as drafted, are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. In addition, the additional element of “receiving…” amounts to no more than insignificant extra-solution activity for receiving data. The additional elements do not integrate the abstract ideas into a practical application. 
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe a processor, memory, and multi-layer neural network for applying the abstract ideas) or insignificant extra-solution activity (i.e. receiving data). Furthermore, the “receive …” limitation is insignificant extra-solution activity that is well-understood, routine, and conventional according to MPEP 2106.05(d) (“The courts have recognized the following computer functions as well‐understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity… i. Receiving or transmitting data over a network). Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.
Regarding Claim 4,
Claim 4 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 4 is directed to a system, which is directed to a machine, one of the statutory categories.
Step 2A Prong One Analysis: Please see the analysis of claim 2. The limitation of claim 4 is only an additional element to the abstract ideas of claim 2.
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites an additional element that is mere instructions to apply (See MPEP 2106.05(f)). The limitations:
“wherein the weight vector ω represents the weights of the nodes of the output layer of the neural network”
As drafted, is an additional element defining a weight vector of the multi-layer neural network that amounts to no more than mere instructions to apply the exception for the abstract ideas. See MPEP 2106.05(f). Furthermore, the recitation of additional elements in claim 1 of a processor, memory, and a multi-layer neural network, as drafted, are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. In addition, the additional element of “receiving…” amounts to no more than insignificant extra-solution activity for receiving data. Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe a processor, memory, and multi-layer neural network (including the weight vector of the neural network) for applying the abstract ideas) or insignificant extra-solution activity (i.e. receiving data). Furthermore, the “receive …” limitation is insignificant extra-solution activity that is well-understood, routine, and conventional according to MPEP 2106.05(d) (“The courts have recognized the following computer functions as well‐understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity… i. Receiving or transmitting data over a network). Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 5,
Claim 5 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 5 is directed to a system, which is directed to a machine, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“selecting an action, a*, that results in maximum value of Q (si, a)T ω from the set of all possible actions (a* = argmaxaQ(si, a)T ω)”
“setting the training target for the neural network as Q (si, a*)T ω”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) and but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)) and insignificant extra-solution activity (See MPEP 2106.05(g)). The above limitations in the context of this claim encompass generating a training target by selecting an action that results in the maximum value of Q (si, a*)T ω (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can determine the action that results in maximum value of Q (si, a*)T ω and then select that action), and setting the training target as Q (si, a*)T ω (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can set the training target for the neural network as Q (si, a*)T ω).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)) or insignificant extra-solution activity (See MPEP 2106.05(g)). The recitation of additional elements in claim 1 of a processor, memory, and a multi-layer neural network, as drafted, are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. In addition, the additional element of “receiving…” amounts to no more than insignificant extra-solution activity for receiving data. The additional elements do not integrate the abstract ideas into a practical application. 
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe a processor, memory, and multi-layer neural network for applying the abstract ideas) or insignificant extra-solution activity (i.e. receiving data). Furthermore, the “receive …” limitation is insignificant extra-solution activity that is well-understood, routine, and conventional according to MPEP 2106.05(d) (“The courts have recognized the following computer functions as well‐understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity… i. Receiving or transmitting data over a network). Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 6,
Claim 6 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 6 is directed to a system, which is directed to a machine, one of the statutory categories.
Step 2A Prong One Analysis: The limitation:
“wherein the at least some of the parameters of the 5neural network are updated using a gradient descent that minimizes a mean square error (MSE) between the training target and the policy value Q(si, ai) for the corresponding state-action pair in the sample data set D” 
As drafted, under its broadest reasonable interpretation, covers mathematical concepts (mathematical relationships, mathematical formulas or equations, mathematical calculations) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)) and insignificant extra-solution activity (See MPEP 2106.05(g)). The above limitation in the context of this claim encompasses updating at least some of the parameters of the neural network by using gradient descent that minimizes a mean square error between the training target and the policy value for the corresponding state-action pair (corresponds to mathematical equation and calculation; in particular, specification paragraph [0069] shows that minimizing the MSE using gradient descent involves calculating the given equation).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)) or insignificant extra-solution activity (See MPEP 2106.05(g)). The recitation of additional elements in claim 1 of a processor, memory, and a multi-layer neural network, as drafted, are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. In addition, the additional element of “receiving…” amounts to no more than insignificant extra-solution activity for receiving data. The additional elements do not integrate the abstract ideas into a practical application. 
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe a processor, memory, and multi-layer neural network for applying the abstract ideas) or insignificant extra-solution activity (i.e. receiving data). Furthermore, the “receive …” limitation is insignificant extra-solution activity that is well-understood, routine, and conventional according to MPEP 2106.05(d) (“The courts have recognized the following computer functions as well‐understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity… i. Receiving or transmitting data over a network). Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 7,
Claim 7 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 7 is directed to a system, which is directed to a machine, one of the statutory categories.
Step 2A Prong One Analysis: The limitation:
“wherein the MSE is minimized using a least mean square (LMS) algorithm”
As drafted, under its broadest reasonable interpretations, covers mathematical concepts (mathematical relationships, mathematical formulas or equations, mathematical calculations) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)) and insignificant extra-solution activity (See MPEP 2106.05(g)). The above limitation in the context of this claim encompasses minimizing the mean square error (MSE) by using a least mean square (LMS) algorithm (corresponds mathematical calculations).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)) or insignificant extra-solution activity (See MPEP 2106.05(g)). The recitation of additional elements in claim 1 of a processor, memory, and a multi-layer neural network, as drafted, are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. In addition, the additional element of “receiving…” amounts to no more than insignificant extra-solution activity for receiving data. The additional elements do not integrate the abstract ideas into a practical application. 
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe a processor, memory, and multi-layer neural network for applying the abstract ideas) or insignificant extra-solution activity (i.e. receiving data). Furthermore, the “receive …” limitation is insignificant extra-solution activity that is well-understood, routine, and conventional according to MPEP 2106.05(d) (“The courts have recognized the following computer functions as well‐understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity… i. Receiving or transmitting data over a network). Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 8,
Claim 8 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 8 is directed to a system, which is directed to a machine, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“wherein the MSE is defined in accordance with the following equation:
    PNG
    media_image15.png
    69
    226
    media_image15.png
    Greyscale
wherein n is the number of tuples in the sample data set D, Q*(si, a*)T ω is the training target and Q (si, ai) is the policy value for the corresponding state-action pair in the 15sample data set D”
“wherein the sum is first over the states in the sample data set and then over all the actions” 
As drafted, under their broadest reasonable interpretations, covers mathematical concepts (mathematical relationships, mathematical formulas or equations, mathematical calculations) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)) and insignificant extra-solution activity (See MPEP 2106.05(g)). The above limitations in the context of this claim encompasses using the given mathematical equation to define the mean square error (MSE) (corresponds to mathematical equation and calculation), and how the summation within the MSE equation is performed (corresponds to mathematical calculation).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)) or insignificant extra-solution activity (See MPEP 2106.05(g)). The recitation of additional elements in claim 1 of a processor, memory, and a multi-layer neural network, as drafted, are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. In addition, the additional element of “receiving…” amounts to no more than insignificant extra-solution activity for receiving data. The additional elements do not integrate the abstract ideas into a practical application. 
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe a processor, memory, and multi-layer neural network for applying the abstract ideas) or insignificant extra-solution activity (i.e. receiving data). Furthermore, the “receive …” limitation is insignificant extra-solution activity that is well-understood, routine, and conventional according to MPEP 2106.05(d) (“The courts have recognized the following computer functions as well‐understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity… i. Receiving or transmitting data over a network). Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 9,
Claim 9 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 9 is directed to a system, which is directed to a machine, one of the statutory categories.
Step 2A Prong One Analysis: Please see the analysis of claim 1. The limitations of claim 9 are only additional elements to the abstract ideas of claim 1.
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)) or insignificant extra-solution activity (See MPEP 2106.05(g)). The limitations:
“wherein the state of the object in the environment is sensed using one or more of cameras, LIDAR and RADAR”
“wherein the current state of the object in the environment is described by one or more of images, LIDAR 20measurements and RADAR measurements”
As drafted, are additional elements that are part of the insignificant extra-solution activity of claim 1. The limitations of claim 9 further limits the limitation of claim 1 by further defining what data the states in the “receive a sample data set…” limitation comprise and how the states are obtained. Furthermore, the recitation of additional elements in claim 1 of a processor, memory, and a multi-layer neural network, as drafted, are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. In addition, the additional element of “receiving…” amounts to no more than insignificant extra-solution activity for receiving data. The additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe a processor, memory, and multi-layer neural network for applying the abstract ideas) or insignificant extra-solution activity (i.e. receiving data). Furthermore, the “receive …” limitation is insignificant extra-solution activity that is well-understood, routine, and conventional according to MPEP 2106.05(d) (“The courts have recognized the following computer functions as well‐understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity… i. Receiving or transmitting data over a network). Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 10,
Claim 10 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 10 is directed to a system, which is directed to a machine, one of the statutory categories.
Step 2A Prong One Analysis: Please see the analysis of claim 1. The limitation of claim 10 is only an additional element to the abstract ideas of claim 1.
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)) or insignificant extra-solution activity (See MPEP 2106.05(g)). The limitation:
“wherein the action comprises any one or a combination of a steering angle for a steering unit, a throttle value for a throttle unit and braking value for a braking unit”
As drafted, is an additional element that is part of the insignificant extra-solution activity of claim 1. The limitation of claim 10 further limits the limitation of claim 1 by further defining what the action in the “receive a sample data set…” limitation comprises. Furthermore, the recitation of additional elements in claim 1 of a processor, memory, and a multi-layer neural network, as drafted, are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. In addition, the additional element of “receiving…” amounts to no more than insignificant extra-solution activity for receiving data. The additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe a processor, memory, and multi-layer neural network for applying the abstract ideas) or insignificant extra-solution activity (i.e. receiving data). Furthermore, the “receive …” limitation is insignificant extra-solution activity that is well-understood, routine, and conventional according to MPEP 2106.05(d) (“The courts have recognized the following computer functions as well‐understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity… i. Receiving or transmitting data over a network). Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 11,
Claim 11 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 11 is directed to a system, which is directed to a machine, one of the statutory categories.
Step 2A Prong One Analysis: Please see the analysis of claim 1. The limitation of claim 11 is only an additional element to the abstract ideas of claim 1.
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)) or insignificant extra-solution activity (See MPEP 2106.05(g)). The limitation:
“wherein the object is a vehicle, robot or drone”
As drafted, are additional elements that is part of the insignificant extra-solution activity of claim 1. The limitation of claim 10 further limits the limitation of claim 1 by further defining what the object for the state data in the “receive a sample data set…” limitation comprises. Furthermore, the recitation of additional elements in claim 1 of a processor, memory, and a multi-layer neural network, as drafted, are reciting mere instructions to apply language such that it amounts to no more than mere instructions to apply the exceptions. In addition, the additional element of “receiving…” amounts to no more than insignificant extra-solution activity for receiving data. The additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe a processor, memory, and multi-layer neural network for applying the abstract ideas) or insignificant extra-solution activity (i.e. receiving data). Furthermore, the “receive …” limitation is insignificant extra-solution activity that is well-understood, routine, and conventional according to MPEP 2106.05(d) (“The courts have recognized the following computer functions as well‐understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity… i. Receiving or transmitting data over a network). Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.

Regarding Claim 12,
Claim 12 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 12 is directed to a method of training a neural network, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“(i) generating a first set of policy values Q(si, ai) for each state-action pair si, ai in a sample data set D {(si, ai, si+1, ri)} using an action-value function denoted the Q function, wherein si is a current state of the object in the environment, ai is the action 5chosen for the current state, si+1 is a subsequent state of the object and the environment and ri is a reward value for taking an action, ai, in a state, si, the value of which is determined in accordance with a reward function”
“(ii) generating a second set of policy values Q (si+1, a) for each subsequent state si+1 for all tuples in the sample data set D for each action in the set of all possible 10actions using the Q function”
“(iii) generating an approximate action-value function, denoted the Q* function, from the first set of policy values Q(si, ai) for the current state si and the action ai selected for the current state si and the second set of policy values Q (si+1, a) for the subsequent state si+1 after the selected action ai”
“(iv) generating a training target for the neural network using the Q* function”
“(v) calculating a training error as the difference between the training target and the policy value Q (si, ai) for the corresponding state-action pair in the sample data set D”
“(vi) updating at least some of the parameters of the neural network to minimize 20the training error”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) and mathematical concepts (mathematical relationships, mathematical formulas or equations, mathematical calculations) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)) and insignificant extra-solution activity (See MPEP 2106.05(g)). The above limitations in the context of this claim encompass generating a first set of policy values (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can use an action value function (Q function) to generate a first set of policy values for each state-action pair in the sample data set), generating a second set of policy values (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can use an action value function (Q function) to generate a second set of policy values for each subsequent state in the sample data set), generating an approximate action-value function (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can use the first and second set of policy values for a selected action to generate an approximate action-value function (Q* function)), generating a training target (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can use the Q* function to generate a training target), calculating a training error (corresponds mathematical calculation), and updating parameters of the neural network (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can update the neural network parameters so as to minimize the training error).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim does not recite an any additional elements to the abstract ideas discussed above. Since the claim does not include additional elements, the abstract ideas are not integrated into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with regard to integration of the abstract ideas into a practical application, the claim does not recite any additional elements to the abstract idea. Therefore, the claim is not patent eligible.

Regarding Claim 13,
Claim 13 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 13 is directed to a method of training a neural network, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“wherein the operations (iii) to (vi) are repeated for each tuple in the sample data set D”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) and mathematical concepts (mathematical relationships, mathematical formulas or equations, mathematical calculations) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)) and insignificant extra-solution activity (See MPEP 2106.05(g)). The above limitations in the context of this claim encompass repeating operations (iii)-(vi) for each tuple in the sample data set (corresponds to evaluation and judgement (operations (iii), (iv), and (vi) as discussed above in claim 12) and mathematical calculation (operation (v) as discussed above in claim 12); in particular, a human, with the assistance of pen and paper, can repeatedly perform the operations (iii)-(vi) for each tuple in the sample data set).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim does not recite an any additional elements to the abstract ideas discussed above. Since the claim does not include additional elements, the abstract ideas are not integrated into a practical application. 
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with regard to integration of the abstract ideas into a practical application, the claim does not recite any additional elements to the abstract idea. Therefore, the claim is not patent eligible.

Regarding Claim 14,
Claim 14 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 14 is directed to a method of training a neural network, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“initializing a matrix A and a vector b”
“33selecting an action, a*, that results in maximum value of Q (si+1, a) from the set of all possible actions (a* = argmaxaQ (si+1, a))”
“updating the value of the matrix A and the vector b using the following equations
    PNG
    media_image16.png
    81
    303
    media_image16.png
    Greyscale
wherein y is a discount factor between 0 and 1”
“calculating a weight vector ω according to the following equation:
    PNG
    media_image17.png
    34
    107
    media_image17.png
    Greyscale
”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) and mathematical concepts (mathematical relationships, mathematical formulas or equations, mathematical calculations) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)) and insignificant extra-solution activity (See MPEP 2106.05(g)). The above limitations in the context of this claim encompass generating the Q* function by initializing a matrix A and a vector b (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can set initial values (initialize) for a matrix A and a vector b), selecting an action that results in the maximum value of Q (si+1, a) (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can determine the action that results in maximum value of Q (si+1, a) and then select that action), updating the value of the matrix A and the vector b using the given equations (corresponds to mathematical equations and calculation), and calculating a weight vector using the given equation (corresponds to mathematical equation and calculation).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim does not recite an any additional elements to the abstract ideas discussed above. Since the claim does not include additional elements, the abstract ideas are not integrated into a practical application. 
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with regard to integration of the abstract ideas into a practical application, the claim does not recite any additional elements to the abstract idea. Therefore, the claim is not patent eligible.

Regarding Claim 15,
Claim 15 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 15 is directed to a method of training a neural network, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“wherein the weight vector ω represents the weights of the nodes of the output layer of the neural network”
As drafted, is part of the abstract idea of claim 14 of calculating a weight vector ω. The limitation of claim 15 further limits the limitation of claim 14 by defining what the weight vector ω represents. The above limitation in the context of this claim encompasses using the given equation to calculate a weight vector ω that represents the weights of the output layer of the neural network (corresponds to mathematical equation and calculation).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim does not recite an any additional elements to the abstract ideas discussed above. Since the claim does not include additional elements, the abstract ideas are not integrated into a practical application. 
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with regard to integration of the abstract ideas into a practical application, the claim does not recite any additional elements to the abstract idea. Therefore, the claim is not patent eligible.

Regarding Claim 16,
Claim 16 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 16 is directed to a method of training a neural network, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“selecting an action, a*, that results in maximum value of Q (si, a)T ω from the set of all possible actions (a* = argmaxaQ(si, a)T ω)”
“setting the training target for the neural network as Q (si, a*)T ω”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) and but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)) and insignificant extra-solution activity (See MPEP 2106.05(g)). The above limitations in the context of this claim encompass generating a training target by selecting an action that results in the maximum value of Q (si, a*)T ω (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can determine the action that results in maximum value of Q (si, a*)T ω and then select that action), and setting the training target as Q (si, a*)T ω (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can set the training target for the neural network as Q (si, a*)T ω).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim does not recite an any additional elements to the abstract ideas discussed above. Since the claim does not include additional elements, the abstract ideas are not integrated into a practical application. 
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with regard to integration of the abstract ideas into a practical application, the claim does not recite any additional elements to the abstract idea. Therefore, the claim is not patent eligible.

Regarding Claim 17,
Claim 17 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 17 is directed to a method of training a neural network, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“wherein the at least some of the parameters of the neural network are updated using a gradient descent that minimizes a mean square error (MSE) between the training target and the policy value Q(si, ai) for the corresponding state-action pair in the sample data set D”
As drafted, under its broadest reasonable interpretation, covers mathematical concepts (mathematical relationships, mathematical formulas or equations, mathematical calculations) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)) and insignificant extra-solution activity (See MPEP 2106.05(g)). The above limitation in the context of this claim encompasses updating at least some of the parameters of the neural network by using gradient descent that minimizes a mean square error between the training target and the policy value for the corresponding state-action pair (corresponds to mathematical equation and calculation; in particular, specification paragraph [0069] shows that minimizing the MSE using gradient descent involves calculating the given equation).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim does not recite an any additional elements to the abstract ideas discussed above. Since the claim does not include additional elements, the abstract ideas are not integrated into a practical application. 
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with regard to integration of the abstract ideas into a practical application, the claim does not recite any additional elements to the abstract idea. Therefore, the claim is not patent eligible.

Regarding Claim 18,
Claim 18 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 18 is directed to a method of training a neural network, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“wherein the MSE is minimized using a least mean square (LMS) algorithm”
As drafted, under its broadest reasonable interpretations, covers mathematical concepts (mathematical relationships, mathematical formulas or equations, mathematical calculations) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)) and insignificant extra-solution activity (See MPEP 2106.05(g)). The above limitation in the context of this claim encompasses minimizing the mean square error (MSE) by using a least mean square (LMS) algorithm (corresponds mathematical calculations).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim does not recite an any additional elements to the abstract ideas discussed above. Since the claim does not include additional elements, the abstract ideas are not integrated into a practical application. 
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with regard to integration of the abstract ideas into a practical application, the claim does not recite any additional elements to the abstract idea. Therefore, the claim is not patent eligible.

Regarding Claim 19,
Claim 19 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 19 is directed to a method of training a neural network, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“wherein the MSE is defined in accordance with the following equation:
    PNG
    media_image18.png
    72
    228
    media_image18.png
    Greyscale
wherein n is the number of tuples in the sample data set D, Q(si, a*)T ω is the training target and Q (si, ai) is the policy value for the corresponding state-action pair in the sample data set D”
“wherein the sum is first over the states in the sample data set 5and then over all the actions”
As drafted, under their broadest reasonable interpretations, covers mathematical concepts (mathematical relationships, mathematical formulas or equations, mathematical calculations) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)) and insignificant extra-solution activity (See MPEP 2106.05(g)). The above limitations in the context of this claim encompasses using the given mathematical equation to define the mean square error (MSE) (corresponds to mathematical equation and calculation), and how the summation within the MSE equation is performed (corresponds to mathematical calculation).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim does not recite an any additional elements to the abstract ideas discussed above. Since the claim does not include additional elements, the abstract ideas are not integrated into a practical application. 
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with regard to integration of the abstract ideas into a practical application, the claim does not recite any additional elements to the abstract idea. Therefore, the claim is not patent eligible.

Regarding Claim 20,
Claim 20 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 20 is directed to a method of training a neural network, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“wherein the state of the object in the environment is sensed using one or more of cameras, LIDAR and RADAR”
“wherein the current state of the object in the environment is described by one or more of images, LIDAR measurements and RADAR measurements” 
As drafted, are part of the abstract idea of claim 12 of generating a set of policy values for each state-action pair in a sample data set. The limitations of claim 20 further limit the limitation of claim 12 by defining what the states in the sample data set comprise and how they are obtained. The above limitations in the context of this claim encompass generating a first set of policy values Q(si, ai) for each state-action pair si, ai in a sample data set D, wherein the state of the object is sensed using one or more of cameras, LIDAR and RADAR (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can use an action value function (Q function) to generate a first set of policy values for each state-action pair in the sample data set, where the state was sensed by cameras, LIDAR, and/or RADAR (i.e. a human can used the sensed state in the generating of the policy value)), wherein the current state of the object environment is described by one or more of images, LIDAR measurements and RADAR measurements (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can use an action value function (Q function) to generate a first set of policy values for each state-action pair in the sample data set, where the current state is described by images, LIDAR measurements, and/or RADAR measurements (i.e. a human can use the described current state in the generating of the policy value)).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim does not recite an any additional elements to the abstract ideas discussed above. Since the claim does not include additional elements, the abstract ideas are not integrated into a practical application. 
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with regard to integration of the abstract ideas into a practical application, the claim does not recite any additional elements to the abstract idea. Therefore, the claim is not patent eligible.

Regarding Claim 21,
Claim 21 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 21 is directed to a method of training a neural network, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“wherein the action comprises any one or a combination of a steering angle for a steering unit, a throttle value for a throttle unit and braking value for a braking unit”
As drafted, is part of the abstract idea of claim 12 of generating a set of policy values for each state-action pair in a sample data set. The limitation of claim 21 further limits the limitation of claim 12 by defining what the actions in the sample data set comprise. The above limitation in the context of this claim encompasses generating a first set of policy values Q(si, ai) for each state-action pair si, ai in a sample data set D, wherein the action can comprise a steering angle for a steering unit, a throttle value for a throttle unit, and/or a braking value for a braking unit (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can use an action value function (Q function) to generate a first set of policy values for each state-action pair in the sample data set, where the action is a steering angle for a steering unit, a throttle value for a throttle unit, and/or a braking value for a braking unit (i.e. a human can use the described action in the generating of the policy value)).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim does not recite an any additional elements to the abstract ideas discussed above. Since the claim does not include additional elements, the abstract ideas are not integrated into a practical application. 
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with regard to integration of the abstract ideas into a practical application, the claim does not recite any additional elements to the abstract idea. Therefore, the claim is not patent eligible.

Regarding Claim 22,
Claim 22 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 22 is directed to a method of training a neural network, which is directed to a process, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“wherein the object is a vehicle, robot or drone”
As drafted, is part of the abstract idea of claim 12 of generating a set of policy values for each state-action pair in a sample data set. The limitation of claim 22 further limits the limitation of claim 12 by defining what the object that the state and actions in the sample data describe/are used for can be. The above limitation in the context of this claim encompasses generating a first set of policy values Q(si, ai) for each state-action pair si, ai in a sample data set D, wherein the object of the state-action pair can be a vehicle, robot or drone (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can use an action value function (Q function) to generate a first set of policy values for each state-action pair in the sample data set, where the object of the state-action pairs is a vehicle, robot or drone (i.e. a human can use the state-action pairs for a vehicle, robot or drone in the generating of the policy value)).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim does not recite an any additional elements to the abstract ideas discussed above. Since the claim does not include additional elements, the abstract ideas are not integrated into a practical application. 
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with regard to integration of the abstract ideas into a practical application, the claim does not recite any additional elements to the abstract idea. Therefore, the claim is not patent eligible.

Regarding Claim 23,
Claim 23 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 23 is directed to a non-transitory machine readable medium, which is directed to an article of manufacture, one of the statutory categories.
Step 2A Prong One Analysis: The limitations:
“(i) generate a first set of policy values Q(si, ai) for each state-action pair si, ai in a sample data set D {(si, ai, si+1,ri)} using an action-value function denoted the Q 20function, wherein si is a current state of the object in the environment, ai is the action chosen for the current state, si+1 is a subsequent state of the object and the environment and ri is a reward value for taking an action, ai, in a state, si, the value of which is determined in accordance with a reward function”
“(ii) generate a second set of policy values Q (si+1, a) for each subsequent state 25si+1 for all tuples in the sample data set D for each action in the set of all possible actions using the Q function”
“(iii) generate an approximate action-value function, denoted the Q* function, from the first set of policy values Q(si, ai) for the current state si and the action ai selected for the current state si and the second set of policy values Q (si+1, a) for the subsequent state si+1 after the selected action ai”
“(iv) generate a training target for the neural network using the Q* function”
“(v) calculate a training error as the difference between the training target and the policy value Q (si, ai) for the corresponding state-action pair in the sample data set D”
“(vi) update at least some of the parameters of the neural network to minimize 10the training error”
As drafted, under their broadest reasonable interpretations, cover mental processes (concepts performed in the human mind (including an observation, evaluation, judgement, opinion)) and mathematical concepts (mathematical relationships, mathematical formulas or equations, mathematical calculations) but for the recitation of mere instructions to apply language (See MPEP 2106.05(f)) and insignificant extra-solution activity (See MPEP 2106.05(g)). The above limitations in the context of this claim encompass generating a first set of policy values (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can use an action value function (Q function) to generate a first set of policy values for each state-action pair in the sample data set), generating a second set of policy values (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can use an action value function (Q function) to generate a second set of policy values for each subsequent state in the sample data set), generating an approximate action-value function (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can use the first and second set of policy values for a selected action to generate an approximate action-value function (Q* function)), generating a training target (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can use the Q* function to generate a training target), calculating a training error (corresponds mathematical calculation), and updating parameters of the neural network (corresponds to evaluation and judgement; in particular, a human, with the assistance of pen and paper, can update the neural network parameters so as to minimize the training error).
Step 2A Prong Two Analysis: The judicial exceptions are not integrated into a practical application. In particular, the claim recites additional elements that are mere instructions to apply (See MPEP 2106.05(f)). The limitations:
“a processor”
“a computing device”
As drafted, are additional elements that amount to no more than mere instructions to apply the exception for the abstract ideas. See MPEP 2106.05(f). Therefore, the additional elements do not integrate the abstract ideas into a practical application.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, all of the additional elements are “mere instructions to apply an exception” (I.e. the additional elements describe a processor and a computing device for applying the abstract ideas). Mere instructions to apply an exception cannot provide an inventive concept. The claim is not patent eligible.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 2, 5-13, and 16-23 are rejected under 35 U.S.C. 103 as being unpatentable over Shalev-Shwartz et al. (WO 2017120336 A2) in view of Sasaki et al. ("A Study on Vision-based Mobile Robot Learning by Deep Q-network"), and further in view of Wang et al. ("Formulation of Deep Reinforcement Learning Architecture Toward Autonomous Driving for On-Ramp Merge").
Regarding Claim 1,
Shalev-Shwartz et al. teaches a system, comprising: a processor (Fig. 1; [065]: " FIG. 1 is a block diagram representation of a system 100 consistent with the exemplary disclosed embodiments. System 100 may include various components depending on the requirements of a particular implementation. In some embodiments, system 100 may include a processing unit 110" teaches a system 100 including a processing unit 110, which may include one or more processors); 
a memory coupled to the processor (Fig. 1; [065]: "FIG. 1 is a block diagram representation of a system 100 consistent with the exemplary disclosed embodiments. System 100 may include various components depending on the requirements of a particular implementation. In some embodiments, system 100 may include a processing unit 110, an image acquisition unit 120, a position sensor 130, one or more memory units 140" teaches a memory 140 coupled to the processing unit 110 (processor)), 
the memory storing executable instructions that, when executed by the processor, cause the processor to: receive a sample data set D (Fig. 4; [072]: " Each memory 140, 150 may include software instructions that when executed by a processor …, may control operation of various aspects of system 100" teaches that the memory stores instructions for execution by the processor. [0217]: "First, using imitation, an initial policy can be constructed using the "behavior cloning" paradigm, using large real world data sets" teaches that the system may be trained using real world data sets).
Shalev-Shwartz et al. does not appear to explicitly teach a sample data set D {(si, ai, si+1,ri)}, wherein s; is a current state of the object in the environment, ai is the action chosen for the current state, si+1 is a subsequent state of the object and the environment and ri is a reward value for taking an action, ai, in a state, si, the value of which is determined in accordance with a reward function; apply, to the sample data set, a multi-layer neural network, each layer in the multi-layer neural network comprising a plurality of nodes, each node in each layer having a corresponding weight, wherein the neural network is configured to: (i) generate a first set of policy values Q(si, ai) for each state-action pair si, ai in the sample data set D using an action-value function denoted the Q function; (ii) generate a second set of policy values Q (si+1,a) for each subsequent state si+1, for all tuples in the sample data set D for each action in the set of all possible actions using the Q function; (iii) generate an approximate action-value function, denoted the Q* function, from the first set of policy values Q(si,ai) for the current state si and the action ai selected for the current state si and the second set of policy values Q (si+1,a) for the subsequent state si+1 after the selected action ai; (iv) generate a training target for the neural network using the Q* function; (v) calculate a training error as the difference between the training target and the policy value Q (si,ai) for the corresponding state-action pair in the sample data set D; and (vi) update at least some of the parameters of the neural network to minimize the training error.
However, Sasaki et al. teaches a sample data set D {(si, ai, si+1,ri)}, wherein s; is a current state of the object in the environment, ai is the action chosen for the current state, si+1 is a subsequent state of the object and the environment and ri is a reward value for taking an action, ai, in a state, si, the value of which is determined in accordance with a reward function (Section 2, Second paragraph: "In order to perform experience replay, the agent’s experiences et = (st, at, rt, st+1) at each time step t are stored in the data set D = {e1, ..., et}" teaches that the input data set can contain agent experiences et = (st, at, rt, st+1), where st is the current state, at is the action for the current state, st+1 is the next state, and rt is the reward for the current state. Section 3.3, Third paragraph: "The reward rt at time step t is defined by using the values of 8 distance sensors of the robot and the action selected by the robot as follows:

    PNG
    media_image19.png
    99
    402
    media_image19.png
    Greyscale
" teaches that the reward is determined based on a function based on the taken action in a given state); 
apply, to the sample data set, a multi-layer neural network, each layer in the multi-layer neural network comprising a plurality of nodes, each node in each layer having a corresponding weight (Section 2, Second paragraph: "Reinforcement learning is known to be sometimes unstable or even to diverge when the nonlinear function approximator such as neural network is used to represent the action-value function (Q-function)" teaches the Q-function in Q-learning may be implemented by a neural network. This further teaches that during learning the Q-network is applied to the sample dataset. Section 3.3, Last paragraph: "The structure of CNN is as follows: there are two hidden layers each of which has 8 × 8 weight filters and 2 × 2 max-pooling. The number of weight filters is 32 both in the first convolution layer and in the second convolution layer. The inputs are four gray-scaled images of 50 × 30 [pixels] size. The number of nodes in the hidden layer of classification part is 256. The number of nodes in the output layer is 4, which equals to the number of action candidates of the robot" teaches the CNN (neural network) is a multi-layer network consisting of nodes and corresponding weights), wherein the neural network is configured to: 
(iii) generate an approximate action-value function, denoted the Q* function, from the first set of policy values Q(si,ai) for the current state si and the action ai selected for the current state si and the second set of policy values Q (si+1,a) for the subsequent state si+1 after the selected action ai (Fig. 2; teaches that the Q^ (Q*) function is updated based on the Q function every few steps (iterations) of the algorithm. This means that, for example, after two steps, the Q^ function would be updated using the Q values (policy values) for the current/initial state st (si) and the selected action at (ai) along with the Q values for the subsequent state st+1 (si+1)); 
(iv) generate a training target for the neural network using the Q* function (Fig. 2; Section 2, Third paragraph: "compute the target yj = r + γ maxa Qˆ(s’, a’; θi-) at iteration i68" teaches calculating (generating) a training target by using the Q^ (Q*) function); 
(v) calculate a training error as the difference between the training target and the policy value Q (si,ai) for the corresponding state-action pair in the sample data set D (Fig. 2; Section 2, Third and Fourth paragraphs: "Note that the term r + γ maxa Qˆ(s’, a’; θi) − Q(s, a; θi) in the above loss function means TD error. This TD error depends on the network parameters θi at iteration i68" teaches that the error is the difference between the training target and the policy value Q(s,a) for θi (i.e. Q(si, ai) for the state action pair corresponding to the network parameters θi)); and 
(vi) update at least some of the parameters of the neural network to minimize the training error (Fig. 2; Section 2, Fourth paragraph: "Note that the term r + γ maxa Qˆ(s’, a’; θi) − Q(s, a; θi) in the above loss function means TD error. This TD error depends on the network parameters θi at iteration i68 … After the learning of action-value function based on this approach during C steps, the target network parameters θ− are updated" teaches that the neural network parameters θ are updated to minimize the error).
Shalev-Shwartz et al. and Sasaki et al. are analogous to the claimed invention because they are directed to reinforcement learning for control of an object.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate a sample data set D {(si, ai, si+1,ri)}, wherein s; is a current state of the object in the environment, ai is the action chosen for the current state, si+1 is a subsequent state of the object and the environment and ri is a reward value for taking an action, ai, in a state, si, the value of which is determined in accordance with a reward function; apply, to the sample data set, a multi-layer neural network, each layer in the multi-layer neural network comprising a plurality of nodes, each node in each layer having a corresponding weight, wherein the neural network is configured to: (iii) generate an approximate action-value function, denoted the Q* function, from the first set of policy values Q(si,ai) for the current state si and the action ai selected for the current state si and the second set of policy values Q (si+1,a) for the subsequent state si+1 after the selected action ai; (iv) generate a training target for the neural network using the Q* function; (v) calculate a training error as the difference between the training target and the policy value Q (si,ai) for the corresponding state-action pair in the sample data set D; and (vi) update at least some of the parameters of the neural network to minimize the training error as taught by Sasaki et al. to the disclosed invention of Shalev-Shwartz et al. 
One of ordinary skill in the art would have been motivated to make this modification to increase learning performance because "the original DQN does not have good learning performance in our robot navigation problem. Then, we propose a modified method of DQN which reuses the best target network so far when the performance of learning suddenly decreases" (Sasaki et al. Section 1, Second paragraph).
Shalev-Shwartz et al. in view of Sasaki et al. does not appear to explicitly teach (i) generate a first set of policy values Q(si, ai) for each state-action pair si, ai in the sample data set D using an action-value function denoted the Q function; (ii) generate a second set of policy values Q (si+1,a) for each subsequent state si+1, for all tuples in the sample data set D for each action in the set of all possible actions using the Q function.
However, Wang et al. teaches (i) generate a first set of policy values Q(si, ai) for each state-action pair si, ai in the sample data set D using an action-value function denoted the Q function (Fig. 3; Section B, second paragraph: "One is the Q-value approximation for action selection (left part in Fig. 3) in which the internal state st, … is used as the input to the Q-network to get the chosen action at " teaches that the Q values (policy values) are generated for the state st (si) using the Q-network (Q function)); 
(ii) generate a second set of policy values Q (si+1,a) for each subsequent state si+1, for all tuples in the sample data set D for each action in the set of all possible actions using the Q function (Fig. 3; Section III. B, second paragraph: "One is the Q-value approximation for action selection (left part in Fig. 3) in which the internal state st, … is used as the input to the Q-network to get the chosen action at" teaches that the Q values (policy values) are generated for the state st (si) using the Q-network (Q function). Fig. 3 further teaches that the process is repeated for st+1 (si+1), so the Q values (policy values) will also be generated for Q(st+1,a)).
Shalev-Shwartz et al., Sasaki et al., and Wang et al. are analogous to the claimed invention because they are directed to reinforcement learning for control of an object.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate (i) generate a first set of policy values Q(si, ai) for each state-action pair si, ai in the sample data set D using an action-value function denoted the Q function; (ii) generate a second set of policy values Q (si+1,a) for each subsequent state si+1, for all tuples in the sample data set D for each action in the set of all possible actions using the Q function as taught by Wang et al. to the disclosed invention of Shalev-Shwartz et al. in view of Sasaki et al.
One of ordinary skill in the art would have been motivated to make this modification "to incorporate the influence of historical and interactive driving behaviors on the action selection [in Deep Reinforcement Learning]" (Wang et al. Section V, First paragraph).
Regarding Claim 2,
Shalev-Shwartz et al. in view of Sasaki et al., and further in view of Wang et al. teaches the system of claim 1.
Additionally, Sasaki et al. further teaches wherein the operations (iii) to (vi) are repeated for each tuple in the sample data set D (Fig. 2; teaches that the operations are repeated using the tuple for each time t and each episode M in the sample dataset).
Shalev-Shwartz et al., Sasaki et al., and Wang et al. are analogous to the claimed invention because they are directed to reinforcement learning for control of an object.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein the operations (iii) to (vi) are repeated for each tuple in the sample data set D as taught by Sasaki et al. to the disclosed invention of Shalev-Shwartz et al. in view of Wang et al.
One of ordinary skill in the art would have been motivated to make this modification to increase learning performance because "the original DQN does not have good learning performance in our robot navigation problem. Then, we propose a modified method of DQN which reuses the best target network so far when the performance of learning suddenly decreases" (Sasaki et al. Section 1, Second paragraph).
Regarding Claim 5,
Shalev-Shwartz et al. in view of Sasaki et al., and further in view of Wang et al. teaches the system of claim 1.
Additionally, Sasaki et al. further teaches wherein the neural network is configured to generate a training target by:  31selecting an action, a*, that results in maximum value of Q (si,a)Tω from the set of all possible actions (a* = argmaxaQ(si,a)Tω) (Fig. 2; Section 2, Third paragraph: "compute the target yj = r + γ maxa Qˆ(s’, a’; θi-) at iteration i71" teaches that the calculation of the training target requires the selection of an action a' (a*) to satisfy 

    PNG
    media_image20.png
    36
    191
    media_image20.png
    Greyscale

which can be read as an action a' to satisfy argmaxaQ(si,a)Tθi (where θ is ω)); and 
setting the training target for the neural network as Q (si,a*)Tω (Section 2, Third paragraph: "compute the target yj = r + γ maxa Qˆ(s’, a’; θi-) at iteration i72" teaches that the training target is set as a function of 

    PNG
    media_image21.png
    36
    303
    media_image21.png
    Greyscale

which effectively is Q(si,a')Tθi (where θ is ω)).
Shalev-Shwartz et al., Sasaki et al., and Wang et al. are analogous to the claimed invention because they are directed to reinforcement learning for control of an object.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein the neural network is configured to generate a training target by:  31selecting an action, a*, that results in maximum value of Q (si,a)Tω from the set of all possible actions (a* = argmaxaQ(si,a)Tω); and setting the training target for the neural network as Q (si,a*)Tω as taught by Sasaki et al. to the disclosed invention of Shalev-Shwartz et al. in view of Wang et al.
One of ordinary skill in the art would have been motivated to make this modification to increase learning performance because "the original DQN does not have good learning performance in our robot navigation problem. Then, we propose a modified method of DQN which reuses the best target network so far when the performance of learning suddenly decreases" (Sasaki et al. Section 1, Second paragraph).
Regarding Claim 6,
Shalev-Shwartz et al. in view of Sasaki et al., and further in view of Wang et al. teaches the system of claim 1.
Additionally, Wang et al. further teaches wherein the at least some of the parameters of the neural network are updated using a gradient descent that minimizes a mean square error (MSE) between the training target and the policy value Q(si,ai) for the corresponding state-action pair in the sample data set D (Section IV, fifth paragraph: "The parameter update of the Q-network is conducted through gradient descent" teaches that the parameters of the Q-network (neural network) are updated using gradient descent. Section III. B, second paragraph: "The other part is the Q-network update (right part in Fig.3) where the loss between predicted Q-values and target Q-values is used to update Q-network parameters θ" teaches that the parameters are updated according to a loss (e.g. to minimize the error) between the target Q-values (training target) and the predicted Q-values (Q(si,ai,)). Section III. B, sixth paragraph: "The loss function is defined by the mean square error …" teaches that the loss is defined by a mean square error).
Shalev-Shwartz et al., Sasaki et al., and Wang et al. are analogous to the claimed invention because they are directed to reinforcement learning for control of an object.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein the at least some of the parameters of the neural network are updated using a gradient descent that minimizes a mean square error (MSE) between the training target and the policy value Q(si,ai) for the corresponding state-action pair in the sample data set D as taught by Wang et al. to the disclosed invention of Shalev-Shwartz et al. in view of Sasaki et al.
One of ordinary skill in the art would have been motivated to make this modification "to incorporate the influence of historical and interactive driving behaviors on the action selection [in Deep Reinforcement Learning]" (Wang et al. Section V, First paragraph).
Regarding Claim 7,
Shalev-Shwartz et al. in view of Sasaki et al., and further in view of Wang et al. teaches the system of claim 6.
Additionally, Wang et al. further teaches wherein the MSE is minimized using a least mean square (LMS) algorithm (Fig. 3; Section III. B, sixth paragraph: "The loss function is defined by the mean square error between predicted Q-values QPt and target Q-values QTt, equation (3)" 

    PNG
    media_image22.png
    41
    126
    media_image22.png
    Greyscale

teaches that the mean square error (MSE) is minimized using a least mean square algorithm).
Shalev-Shwartz et al., Sasaki et al., and Wang et al. are analogous to the claimed invention because they are directed to reinforcement learning for control of an object.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein the MSE is minimized using a least mean square (LMS) algorithm as taught by Wang et al. to the disclosed invention of Shalev-Shwartz et al. in view of Sasaki et al.
One of ordinary skill in the art would have been motivated to make this modification "to incorporate the influence of historical and interactive driving behaviors on the action selection [in Deep Reinforcement Learning]" (Wang et al. Section V, First paragraph).
Regarding Claim 8,
Shalev-Shwartz et al. in view of Sasaki et al., and further in view of Wang et al. teaches the system of claim 6.
Additionally, Wang et al. further teaches wherein the MSE is defined in accordance with the following equation:

    PNG
    media_image23.png
    75
    219
    media_image23.png
    Greyscale

wherein n is the number of tuples in the sample data set D, Q*(si,a*)Tω) is the training target and Q (si,ai) is the policy value for the corresponding state-action pair in the sample data set D (Fig. 3; Section III. B, sixth paragraph: "The loss function is defined by the mean square error between predicted Q-values QPt and target Q-values QTt, equation (3). QTt is calculated by the immediate reward rt and the maximum Q-value of the next internal state st+1" 

    PNG
    media_image22.png
    41
    126
    media_image22.png
    Greyscale

teaches that the mean square error (MSE) is minimized using a least mean square algorithm, where N is the total number of samples (tuples) in the training dataset, QT is the training target (Q*(si,a*)Tω), and QP is the predicted policy value (Q (si,ai)) as shown by the following equations:

    PNG
    media_image24.png
    51
    317
    media_image24.png
    Greyscale
), and 
wherein the sum is first over the states in the sample data set and then over all the actions (Fig. 3; teaches that the iterative process and summation is performed for each state-action pair (st, at), where the states st are first followed by the actions at related to the states st).
Shalev-Shwartz et al., Sasaki et al., and Wang et al. are analogous to the claimed invention because they are directed to reinforcement learning for control of an object.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein the MSE is defined in accordance with the following equation:
    PNG
    media_image23.png
    75
    219
    media_image23.png
    Greyscale
wherein n is the number of tuples in the sample data set D, Q*(si,a*)Tω) is the training target and Q (si,ai) is the policy value for the corresponding state-action pair in the sample data set D, and wherein the sum is first over the states in the sample data set and then over all the actions as taught by Wang et al. to the disclosed invention of Shalev-Shwartz et al. in view of Sasaki et al.
One of ordinary skill in the art would have been motivated to make this modification "to incorporate the influence of historical and interactive driving behaviors on the action selection [in Deep Reinforcement Learning]" (Wang et al. Section V, First paragraph).
Regarding Claim 9,
Shalev-Shwartz et al. in view of Sasaki et al., and further in view of Wang et al. teaches the system of claim 1.
	Additionally, Shalev-Shwartz et al. further teaches wherein the state of the object in the environment is sensed using one or more of cameras, LIDAR and RADAR ([0174]: "Such tasks may rely upon input from various sensors and sensing systems associated with the host vehicle. These inputs may include images or image streams from one or more onboard cameras, GPS position information, accelerometer outputs, user feedback, or user inputs to one or more user interface devices, radar, lidar, etc. Sensing, which may include data from cameras and/or any other available sensors, along with map information, may be collected, analyzed, and formulated into a "sensed state," describing information extracted from a scene in the environment of the host vehicle  teaches that the state may be sensed by one or more of cameras, radar, and lidar), 
wherein the current state of the object in the environment is described by one or more of images, LIDAR 20measurements and RADAR measurements ([0174]-[0175]: "While a sensed state may be developed based on image data received from one or more cameras or image sensors associated with a host vehicle, a sensed state for use in navigation may be developed using any suitable sensor or combination of sensors" teaches that the sensed state (current state) of the vehicle in the environment may be described by image data or data from image sensors (e.g. radar and lidar measurements) associated with the vehicle (object)).
Regarding Claim 10,
Shalev-Shwartz et al. in view of Sasaki et al., and further in view of Wang et al. teaches the system of claim 1.
Additionally, Shalev-Shwartz et al. further teaches wherein the action comprises any one or a combination of a steering angle for a steering unit, a throttle value for a throttle unit and braking value for a braking unit (Fig. 2F; [0115]: "FIG. 2F is a diagrammatic representation of exemplary vehicle control systems, consistent with the disclosed embodiments. As indicated in FIG. 2F, vehicle 200 may include throttling system 220, braking system 230, and steering system 240. System 100 may provide inputs (e.g., control signals) to one or more of throttling system 220, braking system 230, and steering system 240 over one or more data links (e.g., any wired and/or wireless link or links for transmitting data). For example, based on analysis of images acquired by image capture devices 122, 124, and/or 126, system 100 may provide control signals to one or more of throttling system 220, braking system 230, and steering system 240 to navigate vehicle 200 (e.g., by causing an acceleration, a turn, a lane shift, etc.)" teaches that the controls from the system 100 issued to the vehicle 200 (e.g. actions) can comprise inputs to one or more of throttling system 220, braking system 230, and steering system 240 including acceleration (throttle value and braking value) and turning (steering angle)).
Regarding Claim 11,
Shalev-Shwartz et al. in view of Sasaki et al., and further in view of Wang et al. teaches the system of claim 1.
	Additionally, Shalev-Shwartz et al. further teaches wherein the object is a vehicle, robot or drone (Fig. 2A; [081]: "System 100, or various components thereof, may be incorporated into various different platforms. In some embodiments, system 100 may be included on a vehicle 200, as shown in FIG. 2A" teaches that the object of the system may be a vehicle 200).
Regarding Claim 12,
Shalev-Shwartz et al. teaches a method of training a neural network ([0179]: "Training of the system using reinforcement learning may involve learning a driving policy in order to map from sensed states to navigational actions" teaches that the system 100 may be trained using a reinforcement learning method. [0309]: "For example, a deep-Q-network (DQN) learning algorithm may be used" teaches that a deep-Q-network (DQN) learning algorithm may be the reinforcement learning method. [0309]: "Instead, the Q function may be approximated by some function from a parametric hypothesis class (e.g., neural networks of a certain architecture)" teaches that the Q function of the DQN may be implemented using a neural network). 
Shalev-Shwartz et al. does not appear to explicitly teach (i) generating a first set of policy values Q(si,ai) for each state-action pair si, ai in a sample data set D {(si, ai, si+1,ri)} using an action-value function denoted the Q function, wherein si is a current state of the object in the environment, ai is the action 5chosen for the current state, si+1 is a subsequent state of the object and the environment and ri is a reward value for taking an action, ai, in a state, si, the value of which is determined in accordance with a reward function; (ii) generating a second set of policy values Q (si+1, a) for each subsequent state si+1 for all tuples in the sample data set D for each action in the set of all possible 10actions using the Q function; (iii) generating an approximate action-value function, denoted the Q* function, from the first set of policy values Q(si,ai) for the current state si and the action ai selected for the current state si and the second set of policy values Q (si+1, a) for the subsequent state si+1 after the selected action ai;  15(iv) generating a training target for the neural network using the Q* function; (v) calculating a training error as the difference between the training target and the policy value Q (si, ai) for the corresponding state-action pair in the sample data set D; and (vi) updating at least some of the parameters of the neural network to minimize 20the training error.
However, Sasaki et al. teaches … a sample data set D {(si, ai, si+1,ri)}… , wherein si is a current state of the object in the environment, ai is the action 5chosen for the current state, si+1 is a subsequent state of the object and the environment and ri is a reward value for taking an action, ai, in a state, si, the value of which is determined in accordance with a reward function (Section 2, Second paragraph: "In order to perform experience replay, the agent’s experiences et = (st, at, rt, st+1) at each time step t are stored in the data set D = {e1, ..., et}" teaches that the input data set can contain agent experiences et = (st, at, rt, st+1), where st is the current state, at is the action for the current state, st+1 is the next state, and rt is the reward for the current state. Section 3.3, Third paragraph: "The reward rt at time step t is defined by using the values of 8 distance sensors of the robot and the action selected by the robot as follows:

    PNG
    media_image19.png
    99
    402
    media_image19.png
    Greyscale
" teaches that the reward is determined based on a function based on the taken action in a given state); 
(iii) generating an approximate action-value function, denoted the Q* function, from the first set of policy values Q(si,ai) for the current state si and the action ai selected for the current state si and the second set of policy values Q (si+1, a) for the subsequent state si+1 after the selected action ai (Fig. 2; teaches that the Q^ (Q*) function is updated based on the Q function every few steps (iterations) of the algorithm. This means that, for example, after two steps, the Q^ function would be updated using the Q values (policy values) for the current/initial state st (si) and the selected action at (ai) along with the Q values for the subsequent state st+1 (si+1));  
15(iv) generating a training target for the neural network using the Q* function (Fig. 2; Section 2, Third paragraph: "compute the target yj = r + γ maxa Qˆ(s’, a’; θi-) at iteration i77" teaches calculating (generating) a training target by using the Q^ (Q*) function); 
(v) calculating a training error as the difference between the training target and the policy value Q (si, ai) for the corresponding state-action pair in the sample data set D (Fig. 2; Section 2, Third and Fourth paragraphs: "Note that the term r + γ maxa Qˆ(s’, a’; θi) − Q(s, a; θi) in the above loss function means TD error. This TD error depends on the network parameters θi at iteration i77" teaches that the error is the difference between the training target and the policy value Q(s,a) for θi (i.e. Q(si, ai) for the state action pair corresponding to the network parameters θi)); and 
(vi) updating at least some of the parameters of the neural network to minimize 20the training error (Fig. 2; Section 2, Fourth paragraph: "Note that the term r + γ maxa Qˆ(s’, a’; θi) − Q(s, a; θi) in the above loss function means TD error. This TD error depends on the network parameters θi at iteration i78 … After the learning of action-value function based on this approach during C steps, the target network parameters θ− are updated" teaches that the neural network parameters θ are updated to minimize the error).
Shalev-Shwartz et al. and Sasaki et al. are analogous to the claimed invention because they are directed to reinforcement learning for control of an object.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate … a sample data set D {(si, ai, si+1,ri)}… , wherein si is a current state of the object in the environment, ai is the action 5chosen for the current state, si+1 is a subsequent state of the object and the environment and ri is a reward value for taking an action, ai, in a state, si, the value of which is determined in accordance with a reward function; (iii) generating an approximate action-value function, denoted the Q* function, from the first set of policy values Q(si,ai) for the current state si and the action ai selected for the current state si and the second set of policy values Q (si+1, a) for the subsequent state si+1 after the selected action ai;  15(iv) generating a training target for the neural network using the Q* function; (v) calculating a training error as the difference between the training target and the policy value Q (si, ai) for the corresponding state-action pair in the sample data set D; and (vi) updating at least some of the parameters of the neural network to minimize 20the training error as taught by Sasaki et al. to the disclosed invention of Shalev-Shwartz et al. 
One of ordinary skill in the art would have been motivated to make this modification to increase learning performance because "the original DQN does not have good learning performance in our robot navigation problem. Then, we propose a modified method of DQN which reuses the best target network so far when the performance of learning suddenly decreases" (Sasaki et al. Section 1, Second paragraph).
Shalev-Shwartz et al. in view of Sasaki et al. does not appear to explicitly teach (i) generating a first set of policy values Q(si,ai) for each state-action pair si, ai in a sample data set D {(si, ai, si+1,ri)} using an action-value function denoted the Q function, and (ii) generating a second set of policy values Q (si+1, a) for each subsequent state si+1 for all tuples in the sample data set D for each action in the set of all possible 10actions using the Q function.
However, Wang et al. teaches (i) generating a first set of policy values Q(si,ai) for each state-action pair si, ai in a sample data set … using an action-value function denoted the Q function (Fig. 3; Section B, second paragraph: "One is the Q-value approximation for action selection (left part in Fig. 3) in which the internal state st, … is used as the input to the Q-network to get the chosen action at " teaches that the Q values (policy values) are generated for the state st (si) using the Q-network (Q function)), and 
(ii) generating a second set of policy values Q (si+1, a) for each subsequent state si+1 for all tuples in the sample data set D for each action in the set of all possible 10actions using the Q function (Fig. 3; Section III. B, second paragraph: "One is the Q-value approximation for action selection (left part in Fig. 3) in which the internal state st, … is used as the input to the Q-network to get the chosen action at" teaches that the Q values (policy values) are generated for the state st (si) using the Q-network (Q function). Fig. 3 further teaches that the process is repeated for st+1 (si+1), so the Q values (policy values) will also be generated for Q(st+1,a)).
Shalev-Shwartz et al., Sasaki et al., and Wang et al. are analogous to the claimed invention because they are directed to reinforcement learning for control of an object.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate (i) generating a first set of policy values Q(si,ai) for each state-action pair si, ai in a sample data set … using an action-value function denoted the Q function, and (ii) generating a second set of policy values Q (si+1, a) for each subsequent state si+1 for all tuples in the sample data set D for each action in the set of all possible 10actions using the Q function as taught by Wang et al. to the disclosed invention of Shalev-Shwartz et al. in view of Sasaki et al.
One of ordinary skill in the art would have been motivated to make this modification "to incorporate the influence of historical and interactive driving behaviors on the action selection [in Deep Reinforcement Learning]" (Wang et al. Section V, First paragraph).
Regarding Claim 13,
Shalev-Shwartz et al. in view of Sasaki et al., and further in view of Wang et al. teaches the method of claim 12.
Additionally, Sasaki et al. further teaches wherein the operations (iii) to (vi) are repeated for each tuple in the sample data set D (Fig. 2; teaches that the operations are repeated using the tuple for each time t and each episode M in the sample dataset).
Shalev-Shwartz et al., Sasaki et al., and Wang et al. are analogous to the claimed invention because they are directed to reinforcement learning for control of an object.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein the operations (iii) to (vi) are repeated for each tuple in the sample data set D as taught by Sasaki et al. to the disclosed invention of Shalev-Shwartz et al. in view of Wang et al.
One of ordinary skill in the art would have been motivated to make this modification to increase learning performance because "the original DQN does not have good learning performance in our robot navigation problem. Then, we propose a modified method of DQN which reuses the best target network so far when the performance of learning suddenly decreases" (Sasaki et al. Section 1, Second paragraph).
Regarding Claim 16,
Shalev-Shwartz et al. in view of Sasaki et al., and further in view of Wang et al. teaches the method of claim 12.
Additionally, Sasaki et al. further teaches wherein generating the training target comprises: selecting an action, a*, that results in maximum value of Q (si,a)Tω from the set of all possible actions (a* = argmaxaQ(si,a)Tω) (Fig. 2; Section 2, Third paragraph: "compute the target yj = r + γ maxa Qˆ(s’, a’; θi-) at iteration i81" teaches that the calculation of the training target requires the selection of an action a' (a*) to satisfy 

    PNG
    media_image20.png
    36
    191
    media_image20.png
    Greyscale

which can be read as an action a' to satisfy argmaxaQ(si,a)Tθi (where θ is ω)); and 
15setting the training target for the neural network as Q (si,a*)Tω (Section 2, Third paragraph: "compute the target yj = r + γ maxa Qˆ(s’, a’; θi-) at iteration i81" teaches that the training target is set as a function of 

    PNG
    media_image21.png
    36
    303
    media_image21.png
    Greyscale

which effectively is Q(si,a')Tθi (where θ is ω)).
Shalev-Shwartz et al., Sasaki et al., and Wang et al. are analogous to the claimed invention because they are directed to reinforcement learning for control of an object.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein generating the training target comprises: selecting an action, a*, that results in maximum value of Q (si,a)Tω from the set of all possible actions (a* = argmaxaQ(si,a)Tω); and 15setting the training target for the neural network as Q (si,a*)Tω as taught by Sasaki et al. to the disclosed invention of Shalev-Shwartz et al. in view of Wang et al.
One of ordinary skill in the art would have been motivated to make this modification to increase learning performance because "the original DQN does not have good learning performance in our robot navigation problem. Then, we propose a modified method of DQN which reuses the best target network so far when the performance of learning suddenly decreases" (Sasaki et al. Section 1, Second paragraph).
Regarding Claim 17,
Shalev-Shwartz et al. in view of Sasaki et al., and further in view of Wang et al. teaches the method of claim 12.
Additionally, Wang et al. further teaches wherein the at least some of the parameters of the neural network are updated using a gradient descent that minimizes a mean square error (MSE) between the training target and the policy value Q(si,ai) for the corresponding state-action pair in the sample data set D (Section IV, fifth paragraph: "The parameter update of the Q-network is conducted through gradient descent" teaches that the parameters of the Q-network (neural network) are updated using gradient descent. Section III. B, second paragraph: "The other part is the Q-network update (right part in Fig.3) where the loss between predicted Q-values and target Q-values is used to update Q-network parameters θ" teaches that the parameters are updated according to a loss (e.g. to minimize the error) between the target Q-values (training target) and the predicted Q-values (Q(si,ai,)). Section III. B, sixth paragraph: "The loss function is defined by the mean square error …" teaches that the loss is defined by a mean square error).
Shalev-Shwartz et al., Sasaki et al., and Wang et al. are analogous to the claimed invention because they are directed to reinforcement learning for control of an object.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein the at least some of the parameters of the neural network are updated using a gradient descent that minimizes a mean square error (MSE) between the training target and the policy value Q(si,ai) for the corresponding state-action pair in the sample data set D as taught by Wang et al. to the disclosed invention of Shalev-Shwartz et al. in view of Sasaki et al.
One of ordinary skill in the art would have been motivated to make this modification "to incorporate the influence of historical and interactive driving behaviors on the action selection [in Deep Reinforcement Learning]" (Wang et al. Section V, First paragraph).
Regarding Claim 18,
Shalev-Shwartz et al. in view of Sasaki et al., and further in view of Wang et al. teaches the method of claim 17.
Additionally, Wang et al. further teaches wherein the MSE is minimized using a least mean square (LMS) algorithm (Fig. 3; Section III. B, sixth paragraph: "The loss function is defined by the mean square error between predicted Q-values QPt and target Q-values QTt, equation (3)" 

    PNG
    media_image22.png
    41
    126
    media_image22.png
    Greyscale

teaches that the mean square error (MSE) is minimized using a least mean square algorithm).
Shalev-Shwartz et al., Sasaki et al., and Wang et al. are analogous to the claimed invention because they are directed to reinforcement learning for control of an object.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein the MSE is minimized using a least mean square (LMS) algorithm as taught by Wang et al. to the disclosed invention of Shalev-Shwartz et al. in view of Sasaki et al.
One of ordinary skill in the art would have been motivated to make this modification "to incorporate the influence of historical and interactive driving behaviors on the action selection [in Deep Reinforcement Learning]" (Wang et al. Section V, First paragraph).
Regarding Claim 19,
Shalev-Shwartz et al. in view of Sasaki et al., and further in view of Wang et al. teaches the method of claim 17.
Additionally, Wang et al. further teaches wherein the MSE is defined in accordance with the following equation: 

    PNG
    media_image25.png
    67
    219
    media_image25.png
    Greyscale

34wherein n is the number of tuples in the sample data set D, Q(si,a*)Tω is the training target and Q (si,ai) is the policy value for the corresponding state-action pair in the sample data set D (Fig. 3; Section III. B, sixth paragraph: "The loss function is defined by the mean square error between predicted Q-values QPt and target Q-values QTt, equation (3). QTt is calculated by the immediate reward rt and the maximum Q-value of the next internal state st+1" 

    PNG
    media_image22.png
    41
    126
    media_image22.png
    Greyscale

teaches that the mean square error (MSE) is minimized using a least mean square algorithm, where N is the total number of samples (tuples) in the training dataset, QT is the training target (Q*(si,a*)Tω), and QP is the predicted policy value (Q (si,ai)) as shown by the following equations:

    PNG
    media_image24.png
    51
    317
    media_image24.png
    Greyscale
), and 
wherein the sum is first over the states in the sample data set 5and then over all the actions (Fig. 3; teaches that the iterative process and summation is performed for each state-action pair (st, at), where the states st are first followed by the actions at related to the states st).
Shalev-Shwartz et al., Sasaki et al., and Wang et al. are analogous to the claimed invention because they are directed to reinforcement learning for control of an object.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein the MSE is defined in accordance with the following equation: 
    PNG
    media_image25.png
    67
    219
    media_image25.png
    Greyscale
34wherein n is the number of tuples in the sample data set D, Q(si,a*)Tω is the training target and Q (si,ai) is the policy value for the corresponding state-action pair in the sample data set D, and wherein the sum is first over the states in the sample data set 5and then over all the actions as taught by Wang et al. to the disclosed invention of Shalev-Shwartz et al. in view of Sasaki et al.
One of ordinary skill in the art would have been motivated to make this modification "to incorporate the influence of historical and interactive driving behaviors on the action selection [in Deep Reinforcement Learning]" (Wang et al. Section V, First paragraph).
Regarding Claim 20,
Shalev-Shwartz et al. in view of Sasaki et al., and further in view of Wang et al. teaches the method of claim 12.
	Additionally, Shalev-Shwartz et al. further teaches wherein the state of the object in the environment is sensed using one or more of cameras, LIDAR and RADAR ([0174]: "Such tasks may rely upon input from various sensors and sensing systems associated with the host vehicle. These inputs may include images or image streams from one or more onboard cameras, GPS position information, accelerometer outputs, user feedback, or user inputs to one or more user interface devices, radar, lidar, etc. Sensing, which may include data from cameras and/or any other available sensors, along with map information, may be collected, analyzed, and formulated into a "sensed state," describing information extracted from a scene in the environment of the host vehicle  teaches that the state may be sensed by one or more of cameras, radar, and lidar), 
wherein the current state of the object in the environment is described by one or more of images, LIDAR measurements and RADAR measurements ([0174]-[0175]: "While a sensed state may be developed based on image data received from one or more cameras or image sensors associated with a host vehicle, a sensed state for use in navigation may be developed using any suitable sensor or combination of sensors" teaches that the sensed state (current state) of the vehicle in the environment may be described by image data or data from image sensors (e.g. radar and lidar measurements) associated with the vehicle (object)).
Regarding Claim 21,
Shalev-Shwartz et al. in view of Sasaki et al., and further in view of Wang et al. teaches the method of claim 12.
	Additionally, Shalev-Shwartz et al. further teaches wherein the action comprises any one or a combination of a steering angle for a steering unit, a throttle value for a throttle unit and braking value for a braking unit (Fig. 2F; [0115]: "FIG. 2F is a diagrammatic representation of exemplary vehicle control systems, consistent with the disclosed embodiments. As indicated in FIG. 2F, vehicle 200 may include throttling system 220, braking system 230, and steering system 240. System 100 may provide inputs (e.g., control signals) to one or more of throttling system 220, braking system 230, and steering system 240 over one or more data links (e.g., any wired and/or wireless link or links for transmitting data). For example, based on analysis of images acquired by image capture devices 122, 124, and/or 126, system 100 may provide control signals to one or more of throttling system 220, braking system 230, and steering system 240 to navigate vehicle 200 (e.g., by causing an acceleration, a turn, a lane shift, etc.)" teaches that the controls from the system 100 issued to the vehicle 200 (e.g. actions) can comprise inputs to one or more of throttling system 220, braking system 230, and steering system 240 including acceleration (throttle value and braking value) and turning (steering angle)).
Regarding Claim 22,
Shalev-Shwartz et al. in view of Sasaki et al., and further in view of Wang et al. teaches the method of claim 12.
	Additionally, Shalev-Shwartz et al. further teaches wherein the object is a vehicle, robot or drone (Fig. 2A; [081]: "System 100, or various components thereof, may be incorporated into various different platforms. In some embodiments, system 100 may be included on a vehicle 200, as shown in FIG. 2A" teaches that the object of the system may be a vehicle 200).
Regarding Claim 23,
Shalev-Shwartz et al. teaches a non-transitory machine readable medium having tangibly stored thereon 15executable instructions for execution by a processor of a computing device ([024]: "Consistent with other disclosed embodiments, non-transitory computer-readable storage media may store program instructions, which are executed by at least one processing device and perform any of the methods described herein" teaches a non-transitory computer-readable storage media containing instructions that may be executed by one or more processors. Fig. 1; [065]: "FIG. 1 is a block diagram representation of a system 100 consistent with the exemplary disclosed embodiments. System 100 may include various components depending on the requirements of a particular implementation. In some embodiments, system 100 may include a processing unit 110" teaches that the one or more processors may be implemented on a system 100 (computing device)),
wherein the executable instructions, when executed by the processor of the computing device, cause the computing device to: … (Fig. 4; [072]: "Each memory 140, 150 may include software instructions that when executed by a processor …, may control operation of various aspects of system 100" teaches that the memory stores instructions for execution by the processor of the system 100 (computing device)). 
Shalev-Shwartz et al. does not appear to explicitly teach … (i) generating a first set of policy values Q(si,ai) for each state-action pair si, ai in a sample data set D {(si, ai, si+1,ri)} using an action-value function denoted the Q function, wherein si is a current state of the object in the environment, ai is the action 5chosen for the current state, si+1 is a subsequent state of the object and the environment and ri is a reward value for taking an action, ai, in a state, si, the value of which is determined in accordance with a reward function; (ii) generating a second set of policy values Q (si+1, a) for each subsequent state si+1 for all tuples in the sample data set D for each action in the set of all possible 10actions using the Q function; (iii) generating an approximate action-value function, denoted the Q* function, from the first set of policy values Q(si,ai) for the current state si and the action ai selected for the current state si and the second set of policy values Q (si+1, a) for the subsequent state si+1 after the selected action ai;  15(iv) generating a training target for the neural network using the Q* function; (v) calculating a training error as the difference between the training target and the policy value Q (si, ai) for the corresponding state-action pair in the sample data set D; and (vi) updating at least some of the parameters of the neural network to minimize 20the training error.
However, Sasaki et al. teaches … a sample data set D {(si, ai, si+1,ri)}… , wherein si is a current state of the object in the environment, ai is the action 5chosen for the current state, si+1 is a subsequent state of the object and the environment and ri is a reward value for taking an action, ai, in a state, si, the value of which is determined in accordance with a reward function (Section 2, Second paragraph: "In order to perform experience replay, the agent’s experiences et = (st, at, rt, st+1) at each time step t are stored in the data set D = {e1, ..., et}" teaches that the input data set can contain agent experiences et = (st, at, rt, st+1), where st is the current state, at is the action for the current state, st+1 is the next state, and rt is the reward for the current state. Section 3.3, Third paragraph: "The reward rt at time step t is defined by using the values of 8 distance sensors of the robot and the action selected by the robot as follows:

    PNG
    media_image19.png
    99
    402
    media_image19.png
    Greyscale
" teaches that the reward is determined based on a function based on the taken action in a given state); 
(iii) generating an approximate action-value function, denoted the Q* function, from the first set of policy values Q(si,ai) for the current state si and the action ai selected for the current state si and the second set of policy values Q (si+1, a) for the subsequent state si+1 after the selected action ai (Fig. 2; teaches that the Q^ (Q*) function is updated based on the Q function every few steps (iterations) of the algorithm. This means that, for example, after two steps, the Q^ function would be updated using the Q values (policy values) for the current/initial state st (si) and the selected action at (ai) along with the Q values for the subsequent state st+1 (si+1));  
15(iv) generating a training target for the neural network using the Q* function (Fig. 2; Section 2, Third paragraph: "compute the target yj = r + γ maxa Qˆ(s’, a’; θi-) at iteration i88" teaches calculating (generating) a training target by using the Q^ (Q*) function); 
(v) calculating a training error as the difference between the training target and the policy value Q (si, ai) for the corresponding state-action pair in the sample data set D (Fig. 2; Section 2, Third and Fourth paragraphs: "Note that the term r + γ maxa Qˆ(s’, a’; θi) − Q(s, a; θi) in the above loss function means TD error. This TD error depends on the network parameters θi at iteration i88" teaches that the error is the difference between the training target and the policy value Q(s,a) for θi (i.e. Q(si, ai) for the state action pair corresponding to the network parameters θi)); and 
(vi) updating at least some of the parameters of the neural network to minimize 20the training error (Fig. 2; Section 2, Fourth paragraph: "Note that the term r + γ maxa Qˆ(s’, a’; θi) − Q(s, a; θi) in the above loss function means TD error. This TD error depends on the network parameters θi at iteration i88 … After the learning of action-value function based on this approach during C steps, the target network parameters θ− are updated" teaches that the neural network parameters θ are updated to minimize the error).
Shalev-Shwartz et al. and Sasaki et al. are analogous to the claimed invention because they are directed to reinforcement learning for control of an object.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate … a sample data set D {(si, ai, si+1,ri)}… , wherein si is a current state of the object in the environment, ai is the action 5chosen for the current state, si+1 is a subsequent state of the object and the environment and ri is a reward value for taking an action, ai, in a state, si, the value of which is determined in accordance with a reward function; (iii) generating an approximate action-value function, denoted the Q* function, from the first set of policy values Q(si,ai) for the current state si and the action ai selected for the current state si and the second set of policy values Q (si+1, a) for the subsequent state si+1 after the selected action ai;  15(iv) generating a training target for the neural network using the Q* function; (v) calculating a training error as the difference between the training target and the policy value Q (si, ai) for the corresponding state-action pair in the sample data set D; and (vi) updating at least some of the parameters of the neural network to minimize 20the training error as taught by Sasaki et al. to the disclosed invention of Shalev-Shwartz et al. 
One of ordinary skill in the art would have been motivated to make this modification to increase learning performance because "the original DQN does not have good learning performance in our robot navigation problem. Then, we propose a modified method of DQN which reuses the best target network so far when the performance of learning suddenly decreases" (Sasaki et al. Section 1, Second paragraph).
Shalev-Shwartz et al. in view of Sasaki et al. does not appear to explicitly teach (i) generating a first set of policy values Q(si,ai) for each state-action pair si, ai in a sample data set D {(si, ai, si+1,ri)} using an action-value function denoted the Q function, and (ii) generating a second set of policy values Q (si+1, a) for each subsequent state si+1 for all tuples in the sample data set D for each action in the set of all possible 10actions using the Q function.
However, Wang et al. teaches … (i) generating a first set of policy values Q(si,ai) for each state-action pair si, ai in a sample data set … using an action-value function denoted the Q function (Fig. 3; Section B, second paragraph: "One is the Q-value approximation for action selection (left part in Fig. 3) in which the internal state st, … is used as the input to the Q-network to get the chosen action at " teaches that the Q values (policy values) are generated for the state st (si) using the Q-network (Q function)), and 
(ii) generating a second set of policy values Q (si+1, a) for each subsequent state si+1 for all tuples in the sample data set D for each action in the set of all possible 10actions using the Q function (Fig. 3; Section III. B, second paragraph: "One is the Q-value approximation for action selection (left part in Fig. 3) in which the internal state st, … is used as the input to the Q-network to get the chosen action at" teaches that the Q values (policy values) are generated for the state st (si) using the Q-network (Q function). Fig. 3 further teaches that the process is repeated for st+1 (si+1), so the Q values (policy values) will also be generated for Q(st+1,a)).
Shalev-Shwartz et al., Sasaki et al., and Wang et al. are analogous to the claimed invention because they are directed to reinforcement learning for control of an object.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate (i) generating a first set of policy values Q(si,ai) for each state-action pair si, ai in a sample data set … using an action-value function denoted the Q function, and (ii) generating a second set of policy values Q (si+1, a) for each subsequent state si+1 for all tuples in the sample data set D for each action in the set of all possible 10actions using the Q function as taught by Wang et al. to the disclosed invention of Shalev-Shwartz et al. in view of Sasaki et al.
One of ordinary skill in the art would have been motivated to make this modification "to incorporate the influence of historical and interactive driving behaviors on the action selection [in Deep Reinforcement Learning]" (Wang et al. Section V, First paragraph).

Claims 3, 4, 14, and 15 are rejected under 35 U.S.C. 103 as being unpatentable over Shalev-Shwartz et al. (WO 2017120336 A2) in view of Sasaki et al. ("A Study on Vision-based Mobile Robot Learning by Deep Q-network") in view of Wang et al. ("Formulation of Deep Reinforcement Learning Architecture Toward Autonomous Driving for On-Ramp Merge"), and further in view of Yao et al. ("Approximate Policy Iteration with Linear Action Models").
Regarding Claim 3,
Shalev-Shwartz et al. in view of Sasaki et al., and further in view of Wang et al. teaches the system of claim 1.
Additionally, Wang et al. further teaches for each tuple in the sample data set D: selecting an action, a*, that results in maximum value of Q (si+1,a) from the set of all possible actions (a* = argmaxaQ (si+1,a)) (Fig. 3; Section III. B, sixth paragraph: "QTt is calculated by the immediate reward rt and the maximum Q-value of the next internal state st+1" teaches that the generation of the training target (Qt) using the Q* function involves finding an action a' (a*) that maximizes 
    PNG
    media_image26.png
    31
    135
    media_image26.png
    Greyscale
).
Shalev-Shwartz et al., Sasaki et al., and Wang et al. are analogous to the claimed invention because they are directed to reinforcement learning for control of an object.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate for each tuple in the sample data set D: selecting an action, a*, that results in maximum value of Q (si+1,a) from the set of all possible actions (a* = argmaxaQ (si+1,a)) as taught by Wang et al. to the disclosed invention of Shalev-Shwartz et al. in view of Sasaki et al.
One of ordinary skill in the art would have been motivated to make this modification "to incorporate the influence of historical and interactive driving behaviors on the action selection [in Deep Reinforcement Learning]" (Wang et al. Section V, First paragraph).
Shalev-Shwartz et al. in view of Sasaki et al., and further in view of Wang et al. does not appear to explicitly teach wherein the neural network is configured to generate the Q* function by:  10initializing a matrix A and a vector b; and updating the value of the matrix A and the vector b using the following 15equations

    PNG
    media_image27.png
    78
    300
    media_image27.png
    Greyscale

wherein γ is a discount factor between 0 and 1; and calculating a weight vector ω according to the following equation:

    PNG
    media_image28.png
    36
    98
    media_image28.png
    Greyscale

However, Yao et al. teaches wherein the neural network is configured to generate the Q* function by:  10initializing a matrix A and a vector b (Background section: Sixth paragraph: "LSTD aims at finding a linear-in-the-features approximation … to the value function … For this, it builds a matrix d × d matrix A and d-dimensional vector b … The matrix A and vector b can be built incrementally" teaches that matrix A and vector b are initialized as part of finding (generating) an approximation to the value function (Q* function)); and 
updating the value of the matrix A and the vector b using the following 15equations

    PNG
    media_image27.png
    78
    300
    media_image27.png
    Greyscale

wherein γ is a discount factor between 0 and 1 (Algorithm 2: teaches the matrix A and the vector b are updated using the following equations:

    PNG
    media_image29.png
    39
    183
    media_image29.png
    Greyscale

(where Φi is Q(si,ai) and Φ~i+1 is Q(si+1,a*)). Background section, second Paragraph: " γ ∈ (0, 1) is a discount factor" teaches that γ is a discount factor between 0 and 1); and 
calculating a weight vector ω according to the following equation:

    PNG
    media_image28.png
    36
    98
    media_image28.png
    Greyscale
 (Algorithm 2: teaches that the weight vector θ (ω) is calculated using the equation: 
    PNG
    media_image30.png
    25
    91
    media_image30.png
    Greyscale
).
Shalev-Shwartz et al., Sasaki et al., Wang et al., and Yao et al. are analogous to the claimed invention because they are directed to reinforcement learning for control of an object.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein the neural network is configured to generate the Q* function by:  10initializing a matrix A and a vector b; and updating the value of the matrix A and the vector b using the following 15equations
    PNG
    media_image27.png
    78
    300
    media_image27.png
    Greyscale
wherein γ is a discount factor between 0 and 1; and calculating a weight vector ω according to the following equation:
    PNG
    media_image28.png
    36
    98
    media_image28.png
    Greyscale
 as taught by Yao et al. to the disclosed invention of Shalev-Shwartz et al. in view of Sasaki et al., and further in view of Wang et al.
One of ordinary skill in the art would have been motivated to make this modification to "reduce the variance of the value function estimates" (Yao et al. Conclusion section, First paragraph).
Regarding Claim 4,
Shalev-Shwartz et al. in view of Sasaki et al., and further in view of Wang et al. teaches the system of claim 2.
Shalev-Shwartz et al. in view of Sasaki et al., and further in view of Wang et al. does not appear to explicitly teach wherein the weight vector ω represents the weights of the nodes of the output layer of the neural network.
However, Yao et al. teaches wherein the weight vector ω represents the weights of the nodes of the output layer of the neural network (Algorithm 2: teaches that θ (ω) is a weight vector representing the output weights (i.e. weights of the output layer of the neural network)).
Shalev-Shwartz et al., Sasaki et al., Wang et al., and Yao et al. are analogous to the claimed invention because they are directed to reinforcement learning for control of an object.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein the weight vector ω represents the weights of the nodes of the output layer of the neural network as taught by Yao et al. to the disclosed invention of Shalev-Shwartz et al. in view of Sasaki et al., and further in view of Wang et al.
One of ordinary skill in the art would have been motivated to make this modification to "reduce the variance of the value function estimates" (Yao et al. Conclusion section, First paragraph).
Regarding Claim 14,
Shalev-Shwartz et al. in view of Sasaki et al., and further in view of Wang et al. teaches the method of claim 12.
Additionally, Wang et al. further teaches for each tuple in the sample data set D:  33selecting an action, a*, that results in maximum value of Q (si+1,a) from the set of all possible actions (a* = argmaxaQ (si+1,a)) (Fig. 3; Section III. B, sixth paragraph: "QTt is calculated by the immediate reward rt and the maximum Q-value of the next internal state st+1" teaches that the generation of the training target (Qt) using the Q* function involves finding an action a' (a*) that maximizes 
    PNG
    media_image26.png
    31
    135
    media_image26.png
    Greyscale
).
Shalev-Shwartz et al., Sasaki et al., and Wang et al. are analogous to the claimed invention because they are directed to reinforcement learning for control of an object.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate for each tuple in the sample data set D:  33selecting an action, a*, that results in maximum value of Q (si+1,a) from the set of all possible actions (a* = argmaxaQ (si+1,a)) as taught by Wang et al. to the disclosed invention of Shalev-Shwartz et al. in view of Sasaki et al.
One of ordinary skill in the art would have been motivated to make this modification "to incorporate the influence of historical and interactive driving behaviors on the action selection [in Deep Reinforcement Learning]" (Wang et al. Section V, First paragraph).
Shalev-Shwartz et al. in view of Sasaki et al., and further in view of Wang et al. does not appear to explicitly teach wherein generating the Q* function comprises: initializing a matrix A and a vector b; 25and updating the value of the matrix A and the vector b using the following equations
    PNG
    media_image31.png
    81
    298
    media_image31.png
    Greyscale
wherein γ is a discount factor between 0 and 1; and calculating a weight vector ω according to the following equation:
    PNG
    media_image32.png
    37
    98
    media_image32.png
    Greyscale

However, Yao et al. teaches wherein generating the Q* function comprises: initializing a matrix A and a vector b (Background section: Sixth paragraph: "LSTD aims at finding a linear-in-the-features approximation … to the value function … For this, it builds a matrix d × d matrix A and d-dimensional vector b … The matrix A and vector b can be built incrementally" teaches that matrix A and vector b are initialized as part of finding (generating) an approximation to the value function (Q* function)); 25and 
updating the value of the matrix A and the vector b using the following equations

    PNG
    media_image31.png
    81
    298
    media_image31.png
    Greyscale

wherein γ is a discount factor between 0 and 1 (Algorithm 2: teaches the matrix A and the vector b are updated using the following equations:

    PNG
    media_image29.png
    39
    183
    media_image29.png
    Greyscale

(where Φi is Q(si,ai) and Φ~i+1 is Q(si+1,a*)). Background section, second Paragraph: " γ ∈ (0, 1) is a discount factor" teaches that γ is a discount factor between 0 and 1); and 
calculating a weight vector ω according to the following equation:

    PNG
    media_image32.png
    37
    98
    media_image32.png
    Greyscale
 (Algorithm 2: teaches that the weight vector θ (ω) is calculated using the equation: 
    PNG
    media_image30.png
    25
    91
    media_image30.png
    Greyscale
).
Shalev-Shwartz et al., Sasaki et al., Wang et al., and Yao et al. are analogous to the claimed invention because they are directed to reinforcement learning for control of an object.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein generating the Q* function comprises: initializing a matrix A and a vector b; 25and updating the value of the matrix A and the vector b using the following equations
    PNG
    media_image31.png
    81
    298
    media_image31.png
    Greyscale
wherein γ is a discount factor between 0 and 1; and calculating a weight vector ω according to the following equation:
    PNG
    media_image32.png
    37
    98
    media_image32.png
    Greyscale
 as taught by Yao et al. to the disclosed invention of Shalev-Shwartz et al. in view of Sasaki et al., and further in view of Wang et al.
One of ordinary skill in the art would have been motivated to make this modification to "reduce the variance of the value function estimates" (Yao et al. Conclusion section, First paragraph).
Regarding Claim 15,
Shalev-Shwartz et al. in view of Sasaki et al., in view of Wang et al., and further in view of Yao et al. teaches the method of claim 14.
	Additionally, Yao et al. further teaches wherein the weight vector ω represents the weights of the nodes of the output layer of the neural network (Algorithm 2: teaches that θ (ω) is a weight vector representing the output weights (i.e. weights of the output layer of the neural network)).
Shalev-Shwartz et al., Sasaki et al., Wang et al., and Yao et al. are analogous to the claimed invention because they are directed to reinforcement learning for control of an object.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to incorporate wherein the weight vector ω represents the weights of the nodes of the output layer of the neural network as taught by Yao et al. to the disclosed invention of Shalev-Shwartz et al. in view of Sasaki et al., and further in view of Wang et al.
One of ordinary skill in the art would have been motivated to make this modification to "reduce the variance of the value function estimates" (Yao et al. Conclusion section, First paragraph).
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to BRIAN J HALES whose telephone number is (571)272-0878. The examiner can normally be reached M-Th 8:00am - 5:00pm and F 8:00am - 2:30pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar can be reached on (571) 272-7796. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/BRIAN J HALES/Examiner, Art Unit 2125                                                                                                                                                                                                        

/KAMRAN AFSHAR/Supervisory Patent Examiner, Art Unit 2125