DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art.  The broadest reasonable interpretation of a claim element (also commonly referred to as a claim limitation) is limited by the description in the specification when 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is invoked. 
As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 

(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are:
“data storage device that stores real world data...”
in claims 10 & 19.
A review of the specification shows that the following appears to be the corresponding structure for the above limitation described in the specification: (see at least Applicant Specification, para. [0041]: As will be appreciated, the data storage device 32 may be part of the controller 34, separate from the controller 34, or part of the controller 34 and part of a separate system.)
Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claim 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.  
A claim that recites an abstract idea, a law of nature, or a natural phenomenon is directed to a judicial exception.  Abstract ideas include the following groupings of subject matter, when recited as such in a claim limitation: (a) Mathematical concepts – mathematical relationships, mathematical formulas or equations, mathematical calculations; (b) Certain methods of organizing human activity – fundamental economic principles or practices (including hedging, insurance, mitigating risk); commercial or legal interactions (including agreements in the form of contracts; legal obligations; advertising, marketing or sales activities or behaviors; business relations); managing personal behavior or relationships or interactions between people (including social activities, teaching, and following rules or instructions); and (c) Mental processes – concepts performed in the human mind (including an observation, evaluation, judgment, opinion). See the 2019 Revised Patent Subject Matter Eligibility Guidance.
Even when a judicial element is recited in the claim, an additional claim element(s) that integrates the judicial exception into a practical application of that exception renders the claim eligible under §101.  A claim that integrates a judicial exception into a practical application will apply, rely on, or use the judicial exception in a manner that imposes a meaningful limit on the judicial exception, such that the claim is more than a drafting effort designed to monopolize the 
the additional element(s) reflects an improvement in the functioning of a computer, or an improvement to other technology or technical field; 
the additional element(s) that applies or uses a judicial exception to effect a particular treatment or prophylaxis for a disease or medical condition; 
the additional element(s) implements a judicial exception with, or uses a judicial exception in conjunction with, a particular machine or manufacture that is integral to the claim; 
the additional element(s) effects a transformation or reduction of a particular article to a different state or thing; and 
the additional element(s) applies or uses the judicial exception in some other meaningful way beyond generally linking the use of the judicial exception to a particular technological environment, such that the claim as a whole is more than a drafting effort designed to monopolize the exception.  
Examples in which the judicial exception has not been integrated into a practical application include:
the additional element(s) merely recites the words ‘‘apply it’’ (or an equivalent) with the judicial exception, or merely includes instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea; 
the additional element(s) adds insignificant extra-solution activity to the judicial exception; and
the additional element does no more than generally link the use of a judicial exception to a particular technological environment or field of use.
See the 2019 Revised Patent Subject Matter Eligibility Guidance.
Claims 1, 10, & 19 recite storing, real world data including a sequence of images of a road environment the sequence of images generated based on a vehicle traversing the road environment as drafted, is a device, system, & process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer elements. The claim is practically able to be performed in the mind. For example, but for the “A method of training an autonomous vehicle, an autonomous vehicle, a data storage device, deep reinforcement learning agent, processor, and a training system” language, “the sequence of images generated based on a vehicle traversing the road environment” in the context of this claim encompasses the user taking pictures of the road environment around him. 
The limitation of processing, in an offline simulation environment, the sequence of images associated with a control feature to obtain an optimized set of control policies, as drafted, is a device, system & process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of the system. The claim is practically able to be performed in the mind. For example, but for the “A method of training an autonomous vehicle, an autonomous vehicle, a data storage device, deep reinforcement learning agent, processor, and a training system,” language, “processing, in an offline simulation environment, the sequence of images associated with a control feature to obtain an optimized set of control policies” in the context of this claim encompasses the user simulating the images in their head to obtain certain controls for the vehicle to avoid collision and maintain the vehicle on the path. 
 If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
This judicial exception is not integrated into a practical application. In particular, the claim only recites additional elements – using “A method of training an autonomous vehicle, an autonomous vehicle, a data storage device, deep reinforcement learning agent, processor, and a training system”. The devices are recited at a high-level of generality (i.e., device configured to train an autonomous vehicle) such that it amounts no more than mere instructions to apply the exception using generic computer components. Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
The claim(s) do not include additional elements that are sufficient to amount to significantly more than the judicial exception because the additional elements, as discussed above with respect to integration of the abstract idea into a practical application, the additional elements of using “A method of training an autonomous vehicle, an autonomous vehicle,  a data storage device, deep reinforcement learning agent, processor, and a training system”, amounts to 

Similarly for claims 2, 11, & 20, obtaining a first image from the sequence of images and processing the first image to obtain an action, modifying a next image from the sequence of images based on the action, and determining the optimized set of control policies based on the modified next image, is a device & process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components. For example, “obtaining a first image from the sequence of images and processing the first image to obtain an action, modifying a next image from the sequence of images based on the action, and determining the optimized set of control policies based on the modified next image” in the context of this claim encompasses the user understanding and processing a first image that shows the surrounding and adjusts the vehicle based on the surrounding and adjusting the user to further analyze the next surrounding as the vehicle is traveling and further determining more actions. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
This judicial exception is not integrated into a practical application. In particular, the claim only recites additional elements. The claim(s) do not include additional elements that are sufficient to amount to significantly more than the judicial exception. The devices are recited at a high-level of generality (i.e., device configured to detect train an autonomous vehicle) such that 

Likewise for claims 3 & 12, further comprising determining whether the modified next image depicts an unwanted driving behavior, and when the modified next image does not depict an unwanted driving behavior, processing the modified next image to obtain a next action, and when the modified next image does depict an unwanted driving behavior, processing the first image to obtain the next action, is a device & process, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components. For example, “determining whether the modified next image depicts an unwanted driving behavior, and when the modified next image does not depict an unwanted driving behavior, processing the modified next image to obtain a next action, and when the modified next image does depict an unwanted driving behavior, processing the first image to obtain the next action” in the context of this claim encompasses the user processing a new image if there is no hazardous action happening and reusing the previous image or data to find a better action if the new image leads to an unsafe action. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.


Also for claims 4 & 13, computing a reward based on the modified next image, and wherein the processing the modified next image is based on the reward, is a device that, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components. For example, “determining whether the modified next image depicts an unwanted driving behavior, and when the modified next image does not depict an unwanted driving behavior, processing the modified next image to obtain a next action, and when the modified next image does depict an unwanted driving behavior, processing the first image to obtain the next action” in the context of this claim encompasses a user maintaining the same train of thought of what action to take based on the surroundings detected so as to maintain crash free and being more efficient when driving. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.


Equally for claims 5 & 14, wherein the unwanted driving behavior comprises steering off the road, is a device & process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components. For example, “wherein the unwanted driving behavior comprises steering off the road” in the context of this claim encompasses classifying steering off the road as unsafe driving. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
This judicial exception is not integrated into a practical application. In particular, the claim only recites additional elements. The claim(s) do not include additional elements that are sufficient to amount to significantly more than the judicial exception. The devices are recited at a high-level of generality (i.e., device configured to train an autonomous vehicle) such that it amounts no more than mere instructions to apply the exception using generic computer 

Equally for claims 6, & 15, wherein the unwanted driving behavior comprises steering into an object, is a device & process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components. For example, “wherein the unwanted driving behavior comprises steering into an object” in the context of this claim encompasses classifying steering into an object as unsafe driving. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
This judicial exception is not integrated into a practical application. In particular, the claim only recites additional elements. The claim(s) do not include additional elements that are sufficient to amount to significantly more than the judicial exception. The devices are recited at a high-level of generality (i.e., device configured to train an autonomous vehicle) such that it amounts no more than mere instructions to apply the exception using generic computer components. Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. Mere instructions to apply an exception using generic computer components cannot provide an inventive concept. The claim is not patent eligible.

 If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
This judicial exception is not integrated into a practical application. In particular, the claim only recites additional elements. The claim(s) do not include additional elements that are sufficient to amount to significantly more than the judicial exception. The devices are recited at a high-level of generality (i.e., device configured to train an autonomous vehicle) such that it amounts no more than mere instructions to apply the exception using generic computer components. Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. Mere instructions to apply an exception using generic computer components cannot provide an inventive concept. The claim is not patent eligible.

Equally for claims 8, & 17, wherein the control feature includes steering control, is a device & process that, under its broadest reasonable interpretation, covers performance of the  If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.
This judicial exception is not integrated into a practical application. In particular, the claim only recites additional elements. The claim(s) do not include additional elements that are sufficient to amount to significantly more than the judicial exception. The devices are recited at a high-level of generality (i.e., device configured to train an autonomous vehicle) such that it amounts no more than mere instructions to apply the exception using generic computer components. Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. Mere instructions to apply an exception using generic computer components cannot provide an inventive concept. The claim is not patent eligible.

Equally for claims 9, & 18, wherein the action is associated with a steering angle of a steering system, is a device & process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components. For example, “wherein the action is associated with a steering angle of a steering” in the context of this claim encompasses a feature of the action implemented on the vehicle is to be able to steer the vehicle. If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, 
This judicial exception is not integrated into a practical application. In particular, the claim only recites additional elements. The claim(s) do not include additional elements that are sufficient to amount to significantly more than the judicial exception. The devices are recited at a high-level of generality (i.e., device configured to train an autonomous vehicle) such that it amounts no more than mere instructions to apply the exception using generic computer components. Accordingly, this additional element does not integrate the abstract idea into a practical application because it does not impose any meaningful limits on practicing the abstract idea. Mere instructions to apply an exception using generic computer components cannot provide an inventive concept. The claim is not patent eligible.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claim(s) 1, 8-10, & 17-19 is/are rejected under 35 U.S.C. 102(a)(2) as being anticipated by US 2020/0065665A1 (“Nageshrao”).
As per claim 1 Nageshrao discloses
A method of training an autonomous vehicle, comprising:
(see at least Nageshrao, para. [0019]: The computing device 115 also includes nonvolatile memory such as is known. Computing device 115 can log information by storing the information in nonvolatile memory for later retrieval and transmittal via the vehicle communication network and a vehicle to infrastructure (V-to-I) interface 111 to a server computer 120 or user mobile device 160.), 
real world data including a sequence of images of a road environment (see at least Nageshrao, para. [0026]: A CNN for locating and identifying objects in color images 200 can include a plurality of convolutional layers interspersed with pooling layers followed by a plurality of convolutional layer interspersed with un-pooling layers to restore resolution using skip connections to convolutional layers while maintaining determined location and identity in intermediate data,...A CNN can be trained to determine locations and identities for connected regions of pixels in a color image 200, a process called image segmentation, by training the CNN using recorded images and ground truth regarding the locations and identities of objects and regions in color image 200. & para. [0028]: The simulated traffic scenes can be selected to reproduce a plurality of roadway configurations, traffic, lighting and weather conditions likely to be found in real-world environments, for example. An example of a software program that can be used to produce simulated traffic scenes is TORCS, available at torcs.sourceforge.net as of the date of filing this application. Because the color images 200 included in the simulated data include information from a near-realistic simulated environment, CNN processes the color images 200 as if they included real data from a real-world environment.), 
the sequence of images generated based on a vehicle traversing the road environment (see at least Nageshrao, para. [0024-0025]: The sensors 116 may further be used to collect data including dynamic vehicle 110 data related to operations of the vehicle 110 such as velocity, yaw rate, steering angle, engine speed, brake pressure, oil pressure, the power level applied to controllers 112, 113, 114 in the vehicle 110, connectivity between components, and accurate and timely performance of components of the vehicle 110. Color image 200 can be acquired by sensors 116 including video sensors. Color image 200 can be input by computing device 115 and processed to yield information regarding the real world nearby vehicle 110 for use in operating vehicle 110.);
processing, in an offline simulation environment (see at least Nageshrao, para. [0012]: The first NN can be adapted by retraining the first NN based on information from the periodically retrained second NN. The initial training of the first NN can be based on both safe buffer (state, safe action) pairs and unsafe buffer (state, unsafe action) pairs collected during offline simulation and training of the second NN is based on both safe buffer (state, safe action) pairs and unsafe buffer (state, unsafe action) collected during offline simulation and collected during driving. Operating a vehicle can be based on a vehicle action output from the first NN including determining a path polynomial based on the vehicle action output.), 
the sequence of images with a deep reinforcement learning agent associated with a control feature of the autonomous vehicle to obtain an optimized set of control policies (see at least Nageshrao, para. [0038]: FIG. 5 is a diagram of an example deep reinforcement learning (DRL) system 500 that incorporates DNN 400 in a system that can be trained to output 510 vehicle transition states in response to input 514 vehicle state values in an improved fashion by determining safe operation of vehicle 110 using a safety agent block 508 (SA). DRL system 500 includes a safety agent 508 that inputs the predicted vehicle transition states output from DNN block 400 and evaluates them for safety violations using a short horizon safety check as discussed above and, in examples where the vehicle transition states correspond to safety violations, replace the vehicle transition states with termination vehicle states in output 510.); and
training the autonomous vehicle based on the optimized set of control polices (see at least Nageshrao, para. [0033]: FIG. 4 is a diagram of an example DNN 400 that can calculate and output vehicle transition states 430 of output state layer 428 based on input vehicle states 404 of input state layer 402 and, during training, rewards 408 of rewards layer 406. DNN 400 includes hidden layers 416, 422 which respectively include nodes 418, 424 that are fully connected via interconnections 420, 426, 432 with input vehicle states 404, rewards 408 and output vehicle transition states 430. Interconnections 420, 426, 432 are means for transferring data to, from and between nodes 418, 428, where DNN 400 calculations occur. Each node 418 of hidden layer 416 can access all input vehicle states 404 and, during training, all rewards 408 for use in calculating intermediate states to be provided to nodes 424 of hidden layer 422 via interconnections 426. All nodes 424 of hidden layer 422 can access all intermediate states via interconnections 426 for use in calculating and transmitting output vehicle transition states 430 via interconnections 432 to output state layer 428. A vehicle transition state 430 is a set of data providing values describing a vehicle trajectory, e.g., a transition state can include predicted 3D pose, speed, and lateral and longitudinal acceleration data, and can be output to software programs to create a path polynomial 330 for operation of vehicle 110, for example.).


As per claim 8 Nageshrao discloses
wherein the control feature includes steering control of the autonomous vehicle (see at least Nageshrao, para. [0031]: Computing device 115 can operate vehicle 110 based on hierarchical layers of independent software programs that range from high level programs that determine high level tasks like “pick up occupant and transport to destination” or “return to service area” down through mid-level tasks like “turn right at next intersection” or “move to right lane” down to low-level tasks like “turn steering wheel a degrees, release brakes and apply b power for c seconds”. The output path polynomial 330 can be used by computing device 115 to operate vehicle 110 by controlling vehicle steering, brakes, and powertrain via controllers 112, 113, 114 to cause vehicle 110 to travel along path polynomial 330.).

As per claim 9 Nageshrao discloses
wherein the action is associated with a steering angle of a steering system of the autonomous vehicle (see at least Nageshrao, para. [0031]: Computing device 115 can operate vehicle 110 based on hierarchical layers of independent software programs that range from high level programs that determine high level tasks like “pick up occupant and transport to destination” or “return to service area” down through mid-level tasks like “turn right at next intersection” or “move to right lane” down to low-level tasks like “turn steering wheel a degrees, release brakes and apply b power for c seconds”. The output path polynomial 330 can be used by computing device 115 to operate vehicle 110 by controlling vehicle steering, brakes, and powertrain via controllers 112, 113, 114 to cause vehicle 110 to travel along path polynomial 330.).

As per claim 10 Nageshrao discloses
a data storage device (see at least Nageshrao, para. [0019]: The computing device 115 also includes nonvolatile memory such as is known. Computing device 115 can log information by storing the information in nonvolatile memory for later retrieval and transmittal via the vehicle communication network and a vehicle to infrastructure (V-to-I) interface 111 to a server computer 120 or user mobile device 160.) 
that stores real world data including a sequence of images of a road environment (see at least Nageshrao, para. [0026]: A CNN for locating and identifying objects in color images 200 can include a plurality of convolutional layers interspersed with pooling layers followed by a plurality of convolutional layer interspersed with un-pooling layers to restore resolution using skip connections to convolutional layers while maintaining determined location and identity in intermediate data,...A CNN can be trained to determine locations and identities for connected regions of pixels in a color image 200, a process called image segmentation, by training the CNN using recorded images and ground truth regarding the locations and identities of objects and regions in color image 200. & para. [0028]: The simulated traffic scenes can be selected to reproduce a plurality of roadway configurations, traffic, lighting and weather conditions likely to be found in real-world environments, for example. An example of a software program that can be used to produce simulated traffic scenes is TORCS, available at torcs.sourceforge.net as of the date of filing this application. Because the color images 200 included in the simulated data include information from a near-realistic simulated environment, CNN processes the color images 200 as if they included real data from a real-world environment.), 
the sequence of images generated based on a vehicle traversing the road environment (see at least Nageshrao, para. [0024]: The sensors 116 may be used to sense the environment in which the vehicle 110 is operating, e.g., sensors 116 can detect phenomena such as weather conditions ( precipitation , external ambient temperature, etc.), the grade of a road, the location of a road (e.g., using road edges, lane markings, etc.), or locations of target objects such as neighboring vehicles 110. The sensors 116 may further be used to collect data including dynamic vehicle 110 data related to operations of the vehicle 110 such as velocity, yaw rate, steering angle, engine speed, brake pressure, oil pressure, the power level applied to controllers 112, 113, 114 in the vehicle 110, connectivity between components, and accurate and timely performance of components of the vehicle 110.);
a processor configured to process, in an offline simulation environment (see at least Nageshrao, para. [0012]: The first NN can be adapted by retraining the first NN based on information from the periodically retrained second NN. The initial training of the first NN can be based on both safe buffer (state, safe action) pairs and unsafe buffer (state, unsafe action) pairs collected during offline simulation and training of the second NN is based on both safe buffer (state, safe action) pairs and unsafe buffer (state, unsafe action) collected during offline simulation and collected during driving. Operating a vehicle can be based on a vehicle action output from the first NN including determining a path polynomial based on the vehicle action output.), 
the sequence of images with a deep reinforcement learning agent associated with a control feature of the autonomous vehicle to obtain an optimized set of control policies (see at least Nageshrao, para. [0038]: FIG. 5 is a diagram of an example deep reinforcement learning (DRL) system 500 that incorporates DNN 400 in a system that can be trained to output 510 vehicle transition states in response to input 514 vehicle state values in an improved fashion by determining safe operation of vehicle 110 using a safety agent block 508 (SA). DRL system 500 includes a safety agent 508 that inputs the predicted vehicle transition states output from DNN block 400 and evaluates them for safety violations using a short horizon safety check as discussed above and, in examples where the vehicle transition states correspond to safety violations, replace the vehicle transition states with termination vehicle states in output 510.), and 
train the autonomous vehicle based on the optimized set of control polices (see at least Nageshrao, para. [0038]: FIG. 5 is a diagram of an example deep reinforcement learning (DRL) system 500 that incorporates DNN 400 in a system that can be trained to output 510 vehicle transition states in response to input 514 vehicle state values in an improved fashion by determining safe operation of vehicle 110 using a safety agent block 508 (SA). DRL system 500 includes a safety agent 508 that inputs the predicted vehicle transition states output from DNN block 400 and evaluates them for safety violations using a short horizon safety check as discussed above and, in examples where the vehicle transition states correspond to safety violations, replace the vehicle transition states with termination vehicle states in output 510.).

As per claim 17 Nageshrao discloses
wherein the control feature includes steering control of the autonomous vehicle (see at least Nageshrao, para. [0031]: Computing device 115 can operate vehicle 110 based on hierarchical layers of independent software programs that range from high level programs that determine high level tasks like “pick up occupant and transport to destination” or “return to service area” down through mid-level tasks like “turn right at next intersection” or “move to right lane” down to low-level tasks like “turn steering wheel a degrees, release brakes and apply b power for c seconds”. The output path polynomial 330 can be used by computing device 115 to operate vehicle 110 by controlling vehicle steering, brakes, and powertrain via controllers 112, 113, 114 to cause vehicle 110 to travel along path polynomial 330.).

As per claim 18 Nageshrao discloses
wherein the action is associated with a steering angle of a steering system of the autonomous vehicle (see at least Nageshrao, para. [0031]: Computing device 115 can operate vehicle 110 based on hierarchical layers of independent software programs that range from high level programs that determine high level tasks like “pick up occupant and transport to destination” or “return to service area” down through mid-level tasks like “turn right at next intersection” or “move to right lane” down to low-level tasks like “turn steering wheel a degrees, release brakes and apply b power for c seconds”. The output path polynomial 330 can be used by computing device 115 to operate vehicle 110 by controlling vehicle steering, brakes, and powertrain via controllers 112, 113, 114 to cause vehicle 110 to travel along path polynomial 330.).

As per claim 19 Nageshrao discloses
one or more sensors that sense a road environment (see at least Nageshrao, para. [0024]: The sensors 116 may be used to sense the environment in which the vehicle 110 is operating, e.g., sensors 116 can detect phenomena such as weather conditions ( precipitation , external ambient temperature, etc.), the grade of a road, the location of a road (e.g., using road edges, lane markings, etc.), or locations of target objects such as neighboring vehicles 110. The sensors 116 may further be used to collect data including dynamic vehicle 110 data related to operations of the vehicle 110 such as velocity, yaw rate, steering angle, engine speed, brake pressure, oil pressure, the power level applied to controllers 112, 113, 114 in the vehicle 110, connectivity between components, and accurate and timely performance of components of the vehicle 110.); and
(see at least Nageshrao, para. [0012]: The first NN can be adapted by retraining the first NN based on information from the periodically retrained second NN. The initial training of the first NN can be based on both safe buffer (state, safe action) pairs and unsafe buffer (state, unsafe action) pairs collected during offline simulation and training of the second NN is based on both safe buffer (state, safe action) pairs and unsafe buffer (state, unsafe action) collected during offline simulation and collected during driving. Operating a vehicle can be based on a vehicle action output from the first NN including determining a path polynomial based on the vehicle action output.):
a data storage device (see at least Nageshrao, para. [0019]: The computing device 115 also includes nonvolatile memory such as is known. Computing device 115 can log information by storing the information in nonvolatile memory for later retrieval and transmittal via the vehicle communication network and a vehicle to infrastructure (V-to-I) interface 111 to a server computer 120 or user mobile device 160.)
that stores real world data including a sequence of images of the road environment (see at least Nageshrao, para. [0026]: A CNN for locating and identifying objects in color images 200 can include a plurality of convolutional layers interspersed with pooling layers followed by a plurality of convolutional layer interspersed with un-pooling layers to restore resolution using skip connections to convolutional layers while maintaining determined location and identity in intermediate data,...A CNN can be trained to determine locations and identities for connected regions of pixels in a color image 200, a process called image segmentation, by training the CNN using recorded images and ground truth regarding the locations and identities of objects and regions in color image 200. & para. [0028]: The simulated traffic scenes can be selected to reproduce a plurality of roadway configurations, traffic, lighting and weather conditions likely to be found in real-world environments, for example. An example of a software program that can be used to produce simulated traffic scenes is TORCS, available at torcs.sourceforge.net as of the date of filing this application. Because the color images 200 included in the simulated data include information from a near-realistic simulated environment, CNN processes the color images 200 as if they included real data from a real-world environment.), 
the sequence of images generated based on the autonomous vehicle traversing the road environment (see at least Nageshrao, para. [0024]: The sensors 116 may be used to sense the environment in which the vehicle 110 is operating, e.g., sensors 116 can detect phenomena such as weather conditions ( precipitation , external ambient temperature, etc.), the grade of a road, the location of a road (e.g., using road edges, lane markings, etc.), or locations of target objects such as neighboring vehicles 110. The sensors 116 may further be used to collect data including dynamic vehicle 110 data related to operations of the vehicle 110 such as velocity, yaw rate, steering angle, engine speed, brake pressure, oil pressure, the power level applied to controllers 112, 113, 114 in the vehicle 110, connectivity between components, and accurate and timely performance of components of the vehicle 110.);
a processor configured to process offline (see at least Nageshrao, para. [0012]: The first NN can be adapted by retraining the first NN based on information from the periodically retrained second NN. The initial training of the first NN can be based on both safe buffer (state, safe action) pairs and unsafe buffer (state, unsafe action) pairs collected during offline simulation and training of the second NN is based on both safe buffer (state, safe action) pairs and unsafe buffer (state, unsafe action) collected during offline simulation and collected during driving. Operating a vehicle can be based on a vehicle action output from the first NN including determining a path polynomial based on the vehicle action output.) 
(see at least Nageshrao, para. [0038]: FIG. 5 is a diagram of an example deep reinforcement learning (DRL) system 500 that incorporates DNN 400 in a system that can be trained to output 510 vehicle transition states in response to input 514 vehicle state values in an improved fashion by determining safe operation of vehicle 110 using a safety agent block 508 (SA). DRL system 500 includes a safety agent 508 that inputs the predicted vehicle transition states output from DNN block 400 and evaluates them for safety violations using a short horizon safety check as discussed above and, in examples where the vehicle transition states correspond to safety violations, replace the vehicle transition states with termination vehicle states in output 510.), and 
train the autonomous vehicle based on the optimized set of control polices (see at least Nageshrao, para. [0033]: FIG. 4 is a diagram of an example DNN 400 that can calculate and output vehicle transition states 430 of output state layer 428 based on input vehicle states 404 of input state layer 402 and, during training, rewards 408 of rewards layer 406. DNN 400 includes hidden layers 416, 422 which respectively include nodes 418, 424 that are fully connected via interconnections 420, 426, 432 with input vehicle states 404, rewards 408 and output vehicle transition states 430. Interconnections 420, 426, 432 are means for transferring data to, from and between nodes 418, 428, where DNN 400 calculations occur. Each node 418 of hidden layer 416 can access all input vehicle states 404 and, during training, all rewards 408 for use in calculating intermediate states to be provided to nodes 424 of hidden layer 422 via interconnections 426. All nodes 424 of hidden layer 422 can access all intermediate states via interconnections 426 for use in calculating and transmitting output vehicle transition states 430 via interconnections 432 to output state layer 428. A vehicle transition state 430 is a set of data providing values describing a vehicle trajectory, e.g., a transition state can include predicted 3D pose, speed, and lateral and longitudinal acceleration data, and can be output to software programs to create a path polynomial 330 for operation of vehicle 110, for example.).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.

Claims 2-4, 6-7, 11-13, 15-16 & 20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Nageshrao, further in view of US 2019/0258938A1 (“Minh”).
As per claim 2 Nageshrao discloses
wherein the processing the sequence of images comprises:
obtaining a first image from the sequence of images and processing the first image with the deep reinforcement learning agent to obtain an action (see at least Nageshrao, para. [0038]: FIG. 5 is a diagram of an example deep reinforcement learning (DRL) system 500 that incorporates DNN 400 in a system that can be trained to output 510 vehicle transition states in response to input 514 vehicle state values in an improved fashion by determining safe operation of vehicle 110 using a safety agent block 508 (SA). DRL system 500 includes a safety agent 508 that inputs the predicted vehicle transition states output from DNN block 400 and evaluates them for safety violations using a short horizon safety check as discussed above and, in examples where the vehicle transition states correspond to safety violations, replace the vehicle transition states with termination vehicle states in output 510.);
Nageshrao does not explicitly disclose
modifying a next image from the sequence of images based on the action; and
determining the optimized set of control policies based on the modified next image.
Minh teaches
modifying a next image from the sequence of images based on the action (see at least Minh, para. [0016]:  training an action selection policy neural network using a first reinforcement learning technique, wherein the action selection policy neural network has a plurality of network parameters and is used in selecting actions to be performed by an agent interacting with an environment, wherein the action selection policy neural network is configured to receive an input comprising an observation input and to process the input in accordance with the network parameters to generate an action selection policy output, and wherein training the action selection policy neural network comprises adjusting values of the action selection policy network parameters; during the training of the action selection neural network using the first reinforcement learning technique: training a reward prediction neural network on interactions of the agent with the environment during the training of the action selection neural network, wherein the reward prediction neural network has reward prediction parameters and is configured to: receive one or more intermediate outputs generated by the action selection policy neural network that characterize a sequence of observation images received as a result of the interactions of the agent with the environment, and process the one or more intermediate outputs in accordance with the reward prediction parameters to generate a predicted reward that is an estimate of a reward that will be received with a next observation image that follows a last observation image in the sequence; & para. [0047]: Each of the auxiliary control neural networks is associated with one or more respective auxiliary task rewards. The auxiliary task rewards of the pixel control neural network 118 are derived from changes in the pixels in one or more regions from a given observation image 104 to a next observation image received as a result of the agent 110 performing an action 110 in response to the given observation 104. The auxiliary task rewards of the feature control neural network 120 are derived from changes in the activations generated by one or more units in a particular hidden layer of the action selection policy neural network 112 between processing a given observation 104 and processing of a next observation received as a result of the agent 108 performing an action 110 in response to the given observation.); and
determining the optimized set of control policies based on the modified next image (see at least Minh, para. [0016]: training a reward prediction neural network on interactions of the agent with the environment during the training of the action selection neural network, wherein the reward prediction neural network has reward prediction parameters and is configured to: receive one or more intermediate outputs generated by the action selection policy neural network that characterize a sequence of observation images received as a result of the interactions of the agent with the environment, and process the one or more intermediate outputs in accordance with the reward prediction parameters to generate a predicted reward that is an estimate of a reward that will be received with a next observation image that follows a last observation image in the sequence & para. [0047]: Each of the auxiliary control neural networks is associated with one or more respective auxiliary task rewards. The auxiliary task rewards of the pixel control neural network 118 are derived from changes in the pixels in one or more regions from a given observation image 104 to a next observation image received as a result of the agent 110 performing an action 110 in response to the given observation 104. The auxiliary task rewards of the feature control neural network 120 are derived from changes in the activations generated by one or more units in a particular hidden layer of the action selection policy neural network 112 between processing a given observation 104 and processing of a next observation received as a result of the agent 108 performing an action 110 in response to the given observation.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Nageshrao to incorporate the teaching of 

As per claim 3 Nageshrao discloses
further comprising determining whether the image depicts an unwanted driving behavior (see at least Nageshrao, para. [0054]: Process 700 begins at block 702, in which a computing device 115 included in a vehicle 110 can input vehicle sensor data into a CNN to determine vehicle state information, including vehicle location, speed and direction with regard to map data, & para. [0056]: At block 706 of process 700, safety agent block 508 determines whether or not output vehicle transition states represent a safety violation. Safety agent block 508 can be a rule-based machine learning software program that determines safety violations by comparing the vehicle transition states to empirically determined states encoded as rules in a rule-based machine learning software, where rules can be based on empirically determined probabilities related to movement and future positions of other vehicles 314, 316, 318, 320, 322, for example.), 
when the image does depict an unwanted driving behavior (see at least Nageshrao, para. [0059]: At block 712 safety agent 508 outputs 510 vehicle termination states to avoid outputting vehicle transition states having a high probability of including a short-horizon safety violation to computing device 115 to use to determine a path polynomial 330 for vehicle 110 operation. Safety agent 508 outputs safety violation information and vehicle transition states to transition function/reward function block 512 to determine an error function ε based on the vehicle transition function and an input 514 next observation of vehicle states. If the error function ε is greater than or equal to an empirically determined threshold, process 700 passes to block 716. If the error function ε is less than the empirically determined threshold, process 700 ends. & para. [0061]: At block 716 process 700 stores vehicle transition state, safety violation information (unsafe action) and a reward function determined based on the error function ε in buffer B2 518. Process 700 then passes to block 718.), 
processing the first image with the deep learning reinforcement agent to obtain the next action (see at least Nageshrao, para. [0062]: At block 718 process 700 periodically uploads buffers B1 518 and B2 516 to a server computer 120, wherein server computer 120 re-trains a copy of DRL system 500 based on the uploaded buffers. Server computer 120 periodically downloads a re-trained copy of DRL system 500 to vehicle 110 to update DRL system 500. Following block 718 process 700 ends.).
Nageshrao does not explicitly disclose
a modified next image,
when the modified next image does not depict an unwanted driving behavior, processing the modified next image with the deep reinforcement learning agent to obtain a next action.
Minh teaches
a modified next image (see at least Minh, para. [0047]: The auxiliary task rewards of the pixel control neural network 118 are derived from changes in the pixels in one or more regions from a given observation image 104 to a next observation image received as a result of the agent 110 performing an action 110 in response to the given observation 104. The auxiliary task rewards of the feature control neural network 120 are derived from changes in the activations generated by one or more units in a particular hidden layer of the action selection policy neural network 112 between processing a given observation 104 and processing of a next observation received as a result of the agent 108 performing an action 110 in response to the given observation.),
when the modified next image does not depict an unwanted driving behavior (see at least Minh, para. [0016]:  training an action selection policy neural network using a first reinforcement learning technique, wherein the action selection policy neural network has a plurality of network parameters and is used in selecting actions to be performed by an agent interacting with an environment, wherein the action selection policy neural network is configured to receive an input comprising an observation input and to process the input in accordance with the network parameters to generate an action selection policy output, and wherein training the action selection policy neural network comprises adjusting values of the action selection policy network parameters;), 
processing the modified next image with the deep reinforcement learning agent to obtain a next action (see at least Minh, para. [0047]: Each of the auxiliary control neural networks is associated with one or more respective auxiliary task rewards. The auxiliary task rewards of the pixel control neural network 118 are derived from changes in the pixels in one or more regions from a given observation image 104 to a next observation image received as a result of the agent 110 performing an action 110 in response to the given observation 104. The auxiliary task rewards of the feature control neural network 120 are derived from changes in the activations generated by one or more units in a particular hidden layer of the action selection policy neural network 112 between processing a given observation 104 and processing of a next observation received as a result of the agent 108 performing an action 110 in response to the given observation.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Nageshrao to incorporate the teaching of a modified next image, when the modified next image does not depict an unwanted driving behavior, processing the modified next image with the deep reinforcement learning agent to obtain a next action of Minh in order to allow more efficient use of computational resources in training when using reinforcement learning (see at least Minh, para. [0028]).

As per claim 4 Nageshrao does not explicitly disclose
further comprising computing a reward based on the modified next image, and wherein the processing the modified next image is based on the reward.
Minh discloses
further comprising computing a reward based on the modified next image, and wherein the processing the modified next image is based on the reward (see at least Minh, para. [0016]: training a reward prediction neural network on interactions of the agent with the environment during the training of the action selection neural network, wherein the reward prediction neural network has reward prediction parameters and is configured to: receive one or more intermediate outputs generated by the action selection policy neural network that characterize a sequence of observation images received as a result of the interactions of the agent with the environment, and process the one or more intermediate outputs in accordance with the reward prediction parameters to generate a predicted reward that is an estimate of a reward that will be received with a next observation image that follows a last observation image in the sequence & para. [0047]: Each of the auxiliary control neural networks is associated with one or more respective auxiliary task rewards. The auxiliary task rewards of the pixel control neural network 118 are derived from changes in the pixels in one or more regions from a given observation image 104 to a next observation image received as a result of the agent 110 performing an action 110 in response to the given observation 104. The auxiliary task rewards of the feature control neural network 120 are derived from changes in the activations generated by one or more units in a particular hidden layer of the action selection policy neural network 112 between processing a given observation 104 and processing of a next observation received as a result of the agent 108 performing an action 110 in response to the given observation.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Nageshrao to incorporate the teaching of computing a reward based on the modified next image, and wherein the processing the modified next image is based on the reward of Minh in order to allow more efficient use of computational resources in training when using reinforcement learning (see at least Minh, para. [0028]).

As per claim 6 Nageshrao discloses
wherein the unwanted driving behavior comprises steering into an object (see at least Nageshrao, para. [0041]: Short-horizon safety violations can include collisions and near-collisions with other vehicles or pedestrians, or vehicle 110 movement that would require another vehicle or pedestrian to stop or alter direction that would occur during the time frame represented by the operation of vehicle 110 to travel to a predicted 3D location, for example.).

As per claim 7 Nageshrao does not explicitly disclose

Minh teaches
further comprising iteratively processing a next image of the vision sequence with the deep reinforcement learning agent based on a computed reward associated with the next image (see at least Minh, para. [0016]: training a reward prediction neural network on interactions of the agent with the environment during the training of the action selection neural network, wherein the reward prediction neural network has reward prediction parameters and is configured to: receive one or more intermediate outputs generated by the action selection policy neural network that characterize a sequence of observation images received as a result of the interactions of the agent with the environment, and process the one or more intermediate outputs in accordance with the reward prediction parameters to generate a predicted reward that is an estimate of a reward that will be received with a next observation image that follows a last observation image in the sequence & para. [0047]: Each of the auxiliary control neural networks is associated with one or more respective auxiliary task rewards. The auxiliary task rewards of the pixel control neural network 118 are derived from changes in the pixels in one or more regions from a given observation image 104 to a next observation image received as a result of the agent 110 performing an action 110 in response to the given observation 104. The auxiliary task rewards of the feature control neural network 120 are derived from changes in the activations generated by one or more units in a particular hidden layer of the action selection policy neural network 112 between processing a given observation 104 and processing of a next observation received as a result of the agent 108 performing an action 110 in response to the given observation.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Nageshrao to incorporate the teaching of iteratively processing a next image of the vision sequence with the deep reinforcement learning agent based on a computed reward associated with the next image of Minh in order to allow more efficient use of computational resources in training when using reinforcement learning (see at least Minh, para. [0028]).

As per claim 11 Nageshrao discloses
wherein the processor is configured to process the sequence of images by:
obtaining a first image from the sequence of images and processing the first image with the deep reinforcement learning agent to obtain an action (see at least Nageshrao, para. [0038]: FIG. 5 is a diagram of an example deep reinforcement learning (DRL) system 500 that incorporates DNN 400 in a system that can be trained to output 510 vehicle transition states in response to input 514 vehicle state values in an improved fashion by determining safe operation of vehicle 110 using a safety agent block 508 (SA). DRL system 500 includes a safety agent 508 that inputs the predicted vehicle transition states output from DNN block 400 and evaluates them for safety violations using a short horizon safety check as discussed above and, in examples where the vehicle transition states correspond to safety violations, replace the vehicle transition states with termination vehicle states in output 510.);
Nageshrao does not explicitly disclose
modifying a next image from the sequence of images based on the action; and
determining the optimized set of control policies based on the modified next image.
Minh teaches
(see at least Minh, para. [0016]:  training an action selection policy neural network using a first reinforcement learning technique, wherein the action selection policy neural network has a plurality of network parameters and is used in selecting actions to be performed by an agent interacting with an environment, wherein the action selection policy neural network is configured to receive an input comprising an observation input and to process the input in accordance with the network parameters to generate an action selection policy output, and wherein training the action selection policy neural network comprises adjusting values of the action selection policy network parameters; during the training of the action selection neural network using the first reinforcement learning technique: training a reward prediction neural network on interactions of the agent with the environment during the training of the action selection neural network, wherein the reward prediction neural network has reward prediction parameters and is configured to: receive one or more intermediate outputs generated by the action selection policy neural network that characterize a sequence of observation images received as a result of the interactions of the agent with the environment, and process the one or more intermediate outputs in accordance with the reward prediction parameters to generate a predicted reward that is an estimate of a reward that will be received with a next observation image that follows a last observation image in the sequence; & para. [0047]: Each of the auxiliary control neural networks is associated with one or more respective auxiliary task rewards. The auxiliary task rewards of the pixel control neural network 118 are derived from changes in the pixels in one or more regions from a given observation image 104 to a next observation image received as a result of the agent 110 performing an action 110 in response to the given observation 104. The auxiliary task rewards of the feature control neural network 120 are derived from changes in the activations generated by one or more units in a particular hidden layer of the action selection policy neural network 112 between processing a given observation 104 and processing of a next observation received as a result of the agent 108 performing an action 110 in response to the given observation.); and
determining the optimized set of control policies based on the modified next image (see at least Minh, para. [0016]: training a reward prediction neural network on interactions of the agent with the environment during the training of the action selection neural network, wherein the reward prediction neural network has reward prediction parameters and is configured to: receive one or more intermediate outputs generated by the action selection policy neural network that characterize a sequence of observation images received as a result of the interactions of the agent with the environment, and process the one or more intermediate outputs in accordance with the reward prediction parameters to generate a predicted reward that is an estimate of a reward that will be received with a next observation image that follows a last observation image in the sequence & para. [0047]: Each of the auxiliary control neural networks is associated with one or more respective auxiliary task rewards. The auxiliary task rewards of the pixel control neural network 118 are derived from changes in the pixels in one or more regions from a given observation image 104 to a next observation image received as a result of the agent 110 performing an action 110 in response to the given observation 104. The auxiliary task rewards of the feature control neural network 120 are derived from changes in the activations generated by one or more units in a particular hidden layer of the action selection policy neural network 112 between processing a given observation 104 and processing of a next observation received as a result of the agent 108 performing an action 110 in response to the given observation.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Nageshrao to incorporate the teaching of modifying a next image from the sequence of images based on the action; and determining the optimized set of control policies based on the modified next image of Minh in order to allow more efficient use of computational resources in training when using reinforcement learning (see at least Minh, para. [0028]).

As per claim 12 Nageshrao discloses
wherein the processor is configured to determine whether the modified next image depicts an unwanted driving behavior (see at least Nageshrao, para. [0054]: Process 700 begins at block 702, in which a computing device 115 included in a vehicle 110 can input vehicle sensor data into a CNN to determine vehicle state information, including vehicle location, speed and direction with regard to map data, & para. [0056]: At block 706 of process 700, safety agent block 508 determines whether or not output vehicle transition states represent a safety violation. Safety agent block 508 can be a rule-based machine learning software program that determines safety violations by comparing the vehicle transition states to empirically determined states encoded as rules in a rule-based machine learning software, where rules can be based on empirically determined probabilities related to movement and future positions of other vehicles 314, 316, 318, 320, 322, for example.), 
when the image does depict an unwanted driving behavior (see at least Nageshrao, para. [0059]: At block 712 safety agent 508 outputs 510 vehicle termination states to avoid outputting vehicle transition states having a high probability of including a short-horizon safety violation to computing device 115 to use to determine a path polynomial 330 for vehicle 110 operation. Safety agent 508 outputs safety violation information and vehicle transition states to transition function/reward function block 512 to determine an error function ε based on the vehicle transition function and an input 514 next observation of vehicle states. If the error function ε is greater than or equal to an empirically determined threshold, process 700 passes to block 716. If the error function ε is less than the empirically determined threshold, process 700 ends. & para. [0061]: At block 716 process 700 stores vehicle transition state, safety violation information (unsafe action) and a reward function determined based on the error function ε in buffer B2 518. Process 700 then passes to block 718.), 
process the first image with the deep learning reinforcement agent to obtain the next action (see at least Nageshrao, para. [0062]: At block 718 process 700 periodically uploads buffers B1 518 and B2 516 to a server computer 120, wherein server computer 120 re-trains a copy of DRL system 500 based on the uploaded buffers. Server computer 120 periodically downloads a re-trained copy of DRL system 500 to vehicle 110 to update DRL system 500. Following block 718 process 700 ends.).
Nageshrao does not explicitly disclose
a modified next image,
when the modified next image does not depict an unwanted driving behavior, processing the modified next image with the deep reinforcement learning agent to obtain a next action.
Minh teaches
a modified next image (see at least Minh, para. [0047]: The auxiliary task rewards of the pixel control neural network 118 are derived from changes in the pixels in one or more regions from a given observation image 104 to a next observation image received as a result of the agent 110 performing an action 110 in response to the given observation 104. The auxiliary task rewards of the feature control neural network 120 are derived from changes in the activations generated by one or more units in a particular hidden layer of the action selection policy neural network 112 between processing a given observation 104 and processing of a next observation received as a result of the agent 108 performing an action 110 in response to the given observation.),
when the modified next image does not depict an unwanted driving behavior (see at least Minh, para. [0016]:  training an action selection policy neural network using a first reinforcement learning technique, wherein the action selection policy neural network has a plurality of network parameters and is used in selecting actions to be performed by an agent interacting with an environment, wherein the action selection policy neural network is configured to receive an input comprising an observation input and to process the input in accordance with the network parameters to generate an action selection policy output, and wherein training the action selection policy neural network comprises adjusting values of the action selection policy network parameters;), 
process the modified next image with the deep reinforcement learning agent to obtain a next action (see at least Minh, para. [0047]: Each of the auxiliary control neural networks is associated with one or more respective auxiliary task rewards. The auxiliary task rewards of the pixel control neural network 118 are derived from changes in the pixels in one or more regions from a given observation image 104 to a next observation image received as a result of the agent 110 performing an action 110 in response to the given observation 104. The auxiliary task rewards of the feature control neural network 120 are derived from changes in the activations generated by one or more units in a particular hidden layer of the action selection policy neural network 112 between processing a given observation 104 and processing of a next observation received as a result of the agent 108 performing an action 110 in response to the given observation.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Nageshrao to incorporate the teaching of a modified next image, when the modified next image does not depict an unwanted driving behavior, process the modified next image with the deep reinforcement learning agent to obtain a next action of Minh in order to allow more efficient use of computational resources in training when using reinforcement learning (see at least Minh, para. [0028]).

As per claim 13 Nageshrao does not explicitly disclose
wherein the processor is configured to compute a reward based on the modified next image, and wherein the processing the modified next image is based on the reward.
Minh discloses
wherein the processor is configured to compute a reward based on the modified next image, and wherein the processing the modified next image is based on the reward (see at least Minh, para. [0016]: training a reward prediction neural network on interactions of the agent with the environment during the training of the action selection neural network, wherein the reward prediction neural network has reward prediction parameters and is configured to: receive one or more intermediate outputs generated by the action selection policy neural network that characterize a sequence of observation images received as a result of the interactions of the agent with the environment, and process the one or more intermediate outputs in accordance with the reward prediction parameters to generate a predicted reward that is an estimate of a reward that will be received with a next observation image that follows a last observation image in the sequence & para. [0047]: Each of the auxiliary control neural networks is associated with one or more respective auxiliary task rewards. The auxiliary task rewards of the pixel control neural network 118 are derived from changes in the pixels in one or more regions from a given observation image 104 to a next observation image received as a result of the agent 110 performing an action 110 in response to the given observation 104. The auxiliary task rewards of the feature control neural network 120 are derived from changes in the activations generated by one or more units in a particular hidden layer of the action selection policy neural network 112 between processing a given observation 104 and processing of a next observation received as a result of the agent 108 performing an action 110 in response to the given observation.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Nageshrao to incorporate the teaching of wherein the processor is configured to compute a reward based on the modified next image, and wherein the processing the modified next image is based on the reward of Minh in order to allow more efficient use of computational resources in training when using reinforcement learning (see at least Minh, para. [0028]).

As per claim 15 Nageshrao discloses
wherein the unwanted driving behavior comprises steering into an object (see at least Nageshrao, para. [0041]: Short-horizon safety violations can include collisions and near-collisions with other vehicles or pedestrians, or vehicle 110 movement that would require another vehicle or pedestrian to stop or alter direction that would occur during the time frame represented by the operation of vehicle 110 to travel to a predicted 3D location, for example.).

As per claim 16 Nageshrao does not explicitly disclose
wherein the processor is configured to iteratively process a next image of the vision sequence with the deep reinforcement learning agent based on a computed reward associated with the next image.
Minh teaches
wherein the processor is configured to iteratively process a next image of the vision sequence with the deep reinforcement learning agent based on a computed reward associated with the next image (see at least Minh, para. [0016]: training a reward prediction neural network on interactions of the agent with the environment during the training of the action selection neural network, wherein the reward prediction neural network has reward prediction parameters and is configured to: receive one or more intermediate outputs generated by the action selection policy neural network that characterize a sequence of observation images received as a result of the interactions of the agent with the environment, and process the one or more intermediate outputs in accordance with the reward prediction parameters to generate a predicted reward that is an estimate of a reward that will be received with a next observation image that follows a last observation image in the sequence & para. [0047]: Each of the auxiliary control neural networks is associated with one or more respective auxiliary task rewards. The auxiliary task rewards of the pixel control neural network 118 are derived from changes in the pixels in one or more regions from a given observation image 104 to a next observation image received as a result of the agent 110 performing an action 110 in response to the given observation 104. The auxiliary task rewards of the feature control neural network 120 are derived from changes in the activations generated by one or more units in a particular hidden layer of the action selection policy neural network 112 between processing a given observation 104 and processing of a next observation received as a result of the agent 108 performing an action 110 in response to the given observation.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Nageshrao to incorporate the teaching of wherein the processor is configured to iteratively process a next image of the vision sequence with the deep reinforcement learning agent based on a computed reward associated with the next image of Minh in order to allow more efficient use of computational resources in training when using reinforcement learning (see at least Minh, para. [0028]).

As per claim 20 Nageshrao discloses
wherein the processor is configured to process the sequence of images by:
obtaining a first image from the sequence of images and processing the first image with the deep reinforcement learning agent to obtain an action (see at least Nageshrao, para. [0038]: FIG. 5 is a diagram of an example deep reinforcement learning (DRL) system 500 that incorporates DNN 400 in a system that can be trained to output 510 vehicle transition states in response to input 514 vehicle state values in an improved fashion by determining safe operation of vehicle 110 using a safety agent block 508 (SA). DRL system 500 includes a safety agent 508 that inputs the predicted vehicle transition states output from DNN block 400 and evaluates them for safety violations using a short horizon safety check as discussed above and, in examples where the vehicle transition states correspond to safety violations, replace the vehicle transition states with termination vehicle states in output 510.);
Nageshrao does not explicitly disclose

determining the optimized set of control policies based on the modified next image.
Minh teaches
modifying a next image from the sequence of images based on the action (see at least Minh, para. [0016]:  training an action selection policy neural network using a first reinforcement learning technique, wherein the action selection policy neural network has a plurality of network parameters and is used in selecting actions to be performed by an agent interacting with an environment, wherein the action selection policy neural network is configured to receive an input comprising an observation input and to process the input in accordance with the network parameters to generate an action selection policy output, and wherein training the action selection policy neural network comprises adjusting values of the action selection policy network parameters; during the training of the action selection neural network using the first reinforcement learning technique: training a reward prediction neural network on interactions of the agent with the environment during the training of the action selection neural network, wherein the reward prediction neural network has reward prediction parameters and is configured to: receive one or more intermediate outputs generated by the action selection policy neural network that characterize a sequence of observation images received as a result of the interactions of the agent with the environment, and process the one or more intermediate outputs in accordance with the reward prediction parameters to generate a predicted reward that is an estimate of a reward that will be received with a next observation image that follows a last observation image in the sequence; & para. [0047]: Each of the auxiliary control neural networks is associated with one or more respective auxiliary task rewards. The auxiliary task rewards of the pixel control neural network 118 are derived from changes in the pixels in one or more regions from a given observation image 104 to a next observation image received as a result of the agent 110 performing an action 110 in response to the given observation 104. The auxiliary task rewards of the feature control neural network 120 are derived from changes in the activations generated by one or more units in a particular hidden layer of the action selection policy neural network 112 between processing a given observation 104 and processing of a next observation received as a result of the agent 108 performing an action 110 in response to the given observation.); and
determining the optimized set of control policies based on the modified next image (see at least Minh, para. [0016]: training a reward prediction neural network on interactions of the agent with the environment during the training of the action selection neural network, wherein the reward prediction neural network has reward prediction parameters and is configured to: receive one or more intermediate outputs generated by the action selection policy neural network that characterize a sequence of observation images received as a result of the interactions of the agent with the environment, and process the one or more intermediate outputs in accordance with the reward prediction parameters to generate a predicted reward that is an estimate of a reward that will be received with a next observation image that follows a last observation image in the sequence & para. [0047]: Each of the auxiliary control neural networks is associated with one or more respective auxiliary task rewards. The auxiliary task rewards of the pixel control neural network 118 are derived from changes in the pixels in one or more regions from a given observation image 104 to a next observation image received as a result of the agent 110 performing an action 110 in response to the given observation 104. The auxiliary task rewards of the feature control neural network 120 are derived from changes in the activations generated by one or more units in a particular hidden layer of the action selection policy neural network 112 between processing a given observation 104 and processing of a next observation received as a result of the agent 108 performing an action 110 in response to the given observation.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Nageshrao to incorporate the teaching of modifying a next image from the sequence of images based on the action; and determining the optimized set of control policies based on the modified next image of Minh in order to allow more efficient use of computational resources in training when using reinforcement learning (see at least Minh, para. [0028]).

Claims 5 & 14 is/are rejected under 35 U.S.C. 103 as being unpatentable over Nageshrao, in view of Minh, further in view of 2020/0090514A1 (“Sakaguchi”).
As per claim 5 Nageshrao does not explicitly disclose
wherein the unwanted driving behavior comprises steering off the road.
Sakaguchi teaches
wherein the unwanted driving behavior comprises steering off the road (see at least Sakaguchi, para. [0105-0106]: The control unit 3 determines whether the host vehicle has deviated from the lane of the scheduled travel route by using the route information and the surrounding information of the host vehicle acquired by the information acquiring unit 2. The notification unit 4 notifies the driver of the host vehicle when the control unit 3 determines lane deviation of the host vehicle.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Nageshrao to incorporate the teaching of the 

As per claim 14 Nageshrao does not explicitly disclose
wherein the unwanted driving behavior comprises steering off the road.
Sakaguchi teaches
wherein the unwanted driving behavior comprises steering off the road (see at least Sakaguchi, para. [0105-0106]: The control unit 3 determines whether the host vehicle has deviated from the lane of the scheduled travel route by using the route information and the surrounding information of the host vehicle acquired by the information acquiring unit 2. The notification unit 4 notifies the driver of the host vehicle when the control unit 3 determines lane deviation of the host vehicle.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Nageshrao to incorporate the teaching of the unwanted driving behavior comprises steering off the road of Sakaguchi in order to  increase the probability of avoiding hazards (see at least Sakaguchi, para. [0006]).
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MOHAMED ABDO ALGEHAIM whose telephone number is (571)272-3628. The examiner can normally be reached Monday-Friday 8-5PM EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is 
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Fadey Jabr can be reached on 571-272-1516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/M.A.A./Examiner, Art Unit 3668         
/Fadey S. Jabr/Supervisory Patent Examiner, Art Unit 3668