DETAILED ACTION
This communication is a Final Office Action rejection on the merits. Claims 1-20 are currently pending and have been addressed below.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments

Applicant's arguments filed on 09/23/2022 (related to the 101 Rejection) have been fully considered and are persuasive. 101 Rejection has been withdrawn.
Applicant's arguments filed 09/23/2022 (related to the 103 Rejection) have been fully considered but are moot in view of new grounds of rejection. Applicant's amendments necessitated the new ground(s) of rejection presented in this Office action. Rejection based on a newly cited reference(s) follows.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):

(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:

The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


The term "level of similarity" in claims 1, 9, and 16 is a relative term which renders the claim indefinite.  The term "level of similarity" is not defined by the claim, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention.  Applicant’s specification states similarity indices resulting from pattern matching indicate a level of similarity (Paragraph 0022). For examination purposes the term “level of similarity” has been construed to be a match between a data signal and a template stored in a database.
Claims 2-8, 10-15, and 17-20 are also rejected because they depend from one of independent claims 1, 8, or 15.


Claims 1, 9, and 16 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being incomplete for omitting essential structural cooperative relationships of elements, such omission amounting to a gap between the necessary structural connections.  See MPEP § 2172.01.  The omitted structural cooperative relationship is: relationship between the long/short term prediction score and the discount factor. Applicant’s specification states that discount factors may decrease the rewards based on the sensor pattern matching (Paragraphs 0049 & 0072). However, it’s not clear how the long/short term predictor scores are calculated, how the system determines whether the prediction is a long/short term prediction, and how those scores are used to adjust a discount factor. Examiner recommends to further elaborate on how the sensor pattern matching is related to the long/short term predictor scores and how those scores are used to adjust the discount factor.
Claims 2-8, 10-15, and 17-20 are also rejected because they depend from one of independent claims 1, 8, or 15.


Patent Subject Matter Eligibility
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 1-20 are not rejected under 35 U.S.C. 101 because the claimed invention includes an additional element that integrates the abstract idea into a practical application. 
Claims 1, 9, and 16 are eligible. The additional element of “reinforced learning” is used to determine rewards resulting from the action (Paragraph 0016). In this case, observations regarding the results of the actions are provided continuously to the reinforcement learning, wherein the reinforcement learning uses the new observations to update the weights between the nodes in the neural network based on the observations and achieved rewards relative to the possible rewards. Therefore, the additional element of “reinforced learning” integrates the abstract idea into a practical application because the supervised learning model using a feedback loop applies the judicial exception in a meaningful way beyond generally linking the use of the judicial exception to a particular technological environment, such that the claim as a whole is more than a drafting effort designed to monopolize the exception, as discussed in MPEP 2106.05(e) and the Vanda Memo issued in June 2018.”
Claims 2-8, 10-15, and 17-20 are also eligible because they depend from one of independent claims 1, 8, or 15.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The text of those sections of Title 35, U.S. Code not included in this action can be found in a prior Office action.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.



Claims 1-4, 7-11, 14-17, and 19-20 are rejected under 35 U.S.C. 103 as being unpatentable by Levihn et al. (US 11,243,532 B1), in view of Zhang et al. (US 10,434,935 B1), in further view of Bennett et al. (US 2015/0019241 A1).
Regarding claim 1 (Currently Amended), Levihn et al. discloses a computer program product for selecting an action based on reinforced learning (Figure 13, item 9025, Executable instructions; Column 12, lines 52-67, The encodings of the (state, action) pairs may be included in the input data sets of respective instances 330 (such as instances 330A-330K) of a trained deep neural network based reinforcement learning model which has been deployed at the vehicle in the depicted embodiment. The different instances 330 may be executed at least partly in parallel with one another in some embodiments, e.g., using GPU-based devices (or other computing devices supporting parallelism) incorporated within the vehicle. Each of the instances or executions of the model may generate a quality metric termed a Q-value 370 in the depicted embodiment, indicative of the goodness of the corresponding action given the current state and one or more reward functions associated with vehicle trajectories anticipated to be achieved as a result of implementing the actions; Column 13, lines 5-12, Decision-making components of the vehicle may compare the quality metrics or Q-values 370 associated with the different actions, and identify the “best” action 375 given the current set of information available, based at least in part on the quality metrics. After the best action is selected, a concrete motion plan 380 may be generated to implement the action, and corresponding directives may be sent to the motion-control subsystems of the vehicle; See provisional application # 62/564,165, filed on 09/27/17, Paragraphs 0043-0044), the computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions executable by a processor to cause the processor to (Column 4, lines 37-42, According to at least one embodiment, a non-transitory computer-accessible storage medium may store program instructions that when executed on one or more processors cause the one or more processors to identify, corresponding to a state of an environment of a vehicle, a set of proposed actions; See provisional application # 62/564,165, filed on 09/27/17, Paragraph 0010): 
	collect a set of conditions associated with an operating context in which to apply a neural network, the neural network comprising a multi-layered matrix of nodes corresponding to actions having various weights (Column 14, lines 19-29, FIG. 5 illustrates an example neural network architecture which may be used to generate value metrics for respective encodings of state and action combinations, according to at least some embodiments. A convolutional neural network is illustrated by way of example, comprising an input layer 502 to which an encoding of (state, action) combinations denoted as (s, a.sub.j) may be provided, one or more convolutional network layer groups 510 such as 510A and 510B, and a fully-connected layer 530 at which a single Q-value 540 denoted as Q(s, a.sub.j) associated with a particular action a.sub.j may be generated. Values of various weights, biases and/or other parameters at the different layers may be adjusted using back-propagation during the training phase of the model; Examiner interprets the state and action pairs/combinations as the set of conditions associated with an operating context, wherein each state and action pair is derived from sensor data (as seen in Figure 3); See provisional application # 62/564,165, filed on 09/27/17, Paragraph 0048); 
create a first training set comprising the collected set of conditions (see Figure 11 and related text in Column 19, lines 45-67, The data may be aggregated at one or more primary model training data centers 1120 in the depicted embodiment. The data centers may comprise numerous computing platforms, storage platforms and the like, from which some number of training platforms 1122 may be selected to train and evaluate neural network-based models 1150 using any of a variety of machine learning algorithms of a library 1124 (e.g., including algorithms which rely on simulations of driver behavior and/or autonomous vehicle behavior as discussed earlier). Trained models 1150, which may for example the types of DNN-based reinforcement learning models discussed earlier, may be transmitted to autonomous vehicles 1172 (e.g., AV 1172A-1172C) of fleets 1170 in the depicted embodiment. The trained models may be executed using local computing resources at the autonomous vehicle and data collected by local sensors of the autonomous vehicles, e.g., to predict vehicle environment states, evaluate and select actions, generate motion control directives to achieve vehicle trajectories which meet safety, efficiency and other desired criteria, and so on. At least a subset of the decisions made at the vehicle, as well as the local sensor data collected, may be transmitted back to the data centers as part of the ongoing data collection approach, and uses to improve and update the models in various embodiments. In some embodiments, updated versions of the models may be transmitted to the autonomous vehicle fleet from the data centers periodically, e.g., as improvements in the model accuracy and/or efficiency are achieved; Examiner interprets the data collected by local sensors of the autonomous vehicles as the collected set of conditions; See provisional application # 62/564,165, filed on 09/27/17, Paragraph 0069); 
train the neural network by applying the reinforced learning using the first training set until the neural network produces results within a satisfactory range for operational use, wherein the reinforced learning improves the neural network after a plurality of training cycles, each training cycle comprising determining an action based on the first training set (see Figure 11 and related text in Column 19, lines 45-67, The data may be aggregated at one or more primary model training data centers 1120 in the depicted embodiment. The data centers may comprise numerous computing platforms, storage platforms and the like, from which some number of training platforms 1122 may be selected to train and evaluate neural network-based models 1150 using any of a variety of machine learning algorithms of a library 1124 (e.g., including algorithms which rely on simulations of driver behavior and/or autonomous vehicle behavior as discussed earlier). Trained models 1150, which may for example the types of DNN-based reinforcement learning models discussed earlier, may be transmitted to autonomous vehicles 1172 (e.g., AV 1172A-1172C) of fleets 1170 in the depicted embodiment. The trained models may be executed using local computing resources at the autonomous vehicle and data collected by local sensors of the autonomous vehicles, e.g., to predict vehicle environment states, evaluate and select actions, generate motion control directives to achieve vehicle trajectories which meet safety, efficiency and other desired criteria, and so on. At least a subset of the decisions made at the vehicle, as well as the local sensor data collected, may be transmitted back to the data centers as part of the ongoing data collection approach, and uses to improve and update the models in various embodiments. In some embodiments, updated versions of the models may be transmitted to the autonomous vehicle fleet from the data centers periodically, e.g., as improvements in the model accuracy and/or efficiency are achieved; Examiner notes that the model uses the decisions made by the vehicle to further reevaluate/retrain the neural network. Therefore, improving the neural network over time; See provisional application # 62/564,165, filed on 09/27/17, Paragraph 0069), determining a reward corresponding to the action (Column 15, lines 17-51, As shown, the simulation model 620 may also indicate various parameters or elements of a reward function 614 which may be used to assign values to the attained states and the corresponding actions in the depicted embodiment. The reward associated with a given state or (state, action) sequence may be based on a set of parameters that include, for example, the progress made towards the destination of the journey in progress, the probability of avoiding a collision, the extent to which a set of applicable traffic rules is obeyed by the vehicle, a comfort level of one or more occupants of the vehicle, and/or an anticipated social interaction of an occupant of the first vehicle with one or more individuals outside the first vehicle. For example, if a particular action (such as suddenly switching a lane to move in front of another vehicle, or passing a bicycle with a very small clearance) is likely to induce negative feelings or negative social reactions among individuals outside (or inside) the vehicle, a lower reward value may be associated with that action and the resulting state (all other factors being equal) than to a state resulting from another action which is less likely to induce negative feelings. In various embodiments, the simulation model 612 may be trained at least in part on recorded observations obtained from a large number of real-world journeys of autonomous or non-autonomous vehicles. Training iterations of the form indicated in FIG. 6 may be performed in various embodiments until a convergence criterion is met—e.g., until the action selected in various scenarios by the agent model during the course of a journey results in near-optimal reward function values. After the training is complete, the trained agent model may be deployed as the evaluation model to a fleet of vehicles. In some embodiments, the evaluation model(s) deployed to the fleet may comprise elements or all of the simulation model as well as the agent model; See provisional application # 62/564,165, filed on 09/27/17, Paragraph 0051), determining a difference between the reward and an optimal reward, and updating the various weights of the neural network based on the difference (Column 14, lines 19-29, FIG. 5 illustrates an example neural network architecture which may be used to generate value metrics for respective encodings of state and action combinations, according to at least some embodiments. A convolutional neural network is illustrated by way of example, comprising an input layer 502 to which an encoding of (state, action) combinations denoted as (s, a.sub.j) may be provided, one or more convolutional network layer groups 510 such as 510A and 510B, and a fully-connected layer 530 at which a single Q-value 540 denoted as Q(s, a.sub.j) associated with a particular action a.sub.j may be generated. Values of various weights, biases and/or other parameters at the different layers may be adjusted using back-propagation during the training phase of the model; Column 15, lines 17-51, As shown, the simulation model 620 may also indicate various parameters or elements of a reward function 614 which may be used to assign values to the attained states and the corresponding actions in the depicted embodiment. The reward associated with a given state or (state, action) sequence may be based on a set of parameters that include, for example, the progress made towards the destination of the journey in progress, the probability of avoiding a collision, the extent to which a set of applicable traffic rules is obeyed by the vehicle, a comfort level of one or more occupants of the vehicle, and/or an anticipated social interaction of an occupant of the first vehicle with one or more individuals outside the first vehicle. For example, if a particular action (such as suddenly switching a lane to move in front of another vehicle, or passing a bicycle with a very small clearance) is likely to induce negative feelings or negative social reactions among individuals outside (or inside) the vehicle, a lower reward value may be associated with that action and the resulting state (all other factors being equal) than to a state resulting from another action which is less likely to induce negative feelings. In various embodiments, the simulation model 612 may be trained at least in part on recorded observations obtained from a large number of real-world journeys of autonomous or non-autonomous vehicles. Training iterations of the form indicated in FIG. 6 may be performed in various embodiments until a convergence criterion is met—e.g., until the action selected in various scenarios by the agent model during the course of a journey results in near-optimal reward function values. After the training is complete, the trained agent model may be deployed as the evaluation model to a fleet of vehicles. In some embodiments, the evaluation model(s) deployed to the fleet may comprise elements or all of the simulation model as well as the agent model; Examiner notes that “training until a convergence criterion is met” includes the step of “determining the difference between the reward and an optimal reward” as it’s comparing the reward with the optimal reward to find the actions that result in near-optimal reward function values; See provisional application # 62/564,165, filed on 09/27/17, Paragraphs 0048 & 0051); 
	receive a data signal (Figure 3, item 304, Objects/entities derived from sensor data; Column 12, lines 20-28, Raw input collected via a variety of sensors (e.g., similar to the sensors of local sensor collection 112, discussed in the context of FIG. 1) may be processed (e.g., using a perception subsystem similar to subsystem 113 of FIG. 1) to generate representations 304 of nearby objects and entities. In addition, in at least some embodiments, map data 306 may be obtained or stored at the vehicle, and such map data may be used in combination with the sensor-derived data to generate a descriptor of the current state 312; See provisional application # 62/564,165, filed on 09/27/17, Paragraph 0041); 
compare the data signal to one or more predefined patterns to determine one or more long/short term predictor scores ... between a data signal and the one or more predefined patterns (Figure 1, item 113, Perception subsystem, generate abstractions from raw sensor data; Column 6, lines 31-49, According to some embodiments, at various points in time during the course of a journey of the vehicle 110, one or more decision making components 116 (such as the behavior planner 117) may determine the current state of the environment of the vehicle (e.g., its current location and speed, the locations and speeds of other vehicles or objects, and so on). For example, the state may be determined based at least in part on data collected at a local sensor collection 112 and processed at the perception subsystem 113. Corresponding to any given state, a set of feasible or proposed actions may be identified (e.g., by the behavior planner 117 in the depicted embodiment). A given action may be described or represented using a combination of numerous attributes or dimensions, such as a target lane segment which the vehicle may enter, a target speed in that lane segment, relative positioning with respect to other vehicles in the target lane segment (e.g., a position ahead of or behind another vehicle) and so on; Column 12, lines 20-28, Raw input collected via a variety of sensors (e.g., similar to the sensors of local sensor collection 112, discussed in the context of FIG. 1) may be processed (e.g., using a perception subsystem similar to subsystem 113 of FIG. 1) to generate representations 304 of nearby objects and entities. In addition, in at least some embodiments, map data 306 may be obtained or stored at the vehicle, and such map data may be used in combination with the sensor-derived data to generate a descriptor of the current state 312; Column 12, lines 52-66, The encodings of the (state, action) pairs may be included in the input data sets of respective instances 330 (such as instances 330A-330K) of a trained deep neural network based reinforcement learning model which has been deployed at the vehicle in the depicted embodiment. The different instances 330 may be executed at least partly in parallel with one another in some embodiments, e.g., using GPU-based devices (or other computing devices supporting parallelism) incorporated within the vehicle. Each of the instances or executions of the model may generate a quality metric termed a Q-value 370 in the depicted embodiment, indicative of the goodness of the corresponding action given the current state and one or more reward functions associated with vehicle trajectories anticipated to be achieved as a result of implementing the actions; Examiner interprets “generate a descriptor of the current state based on map data 306 stored at the vehicle” as “comparing the data signal to one or more predefined patterns.” Also, Examiner notes that Figure 3 reflects a short-term plan and Figure 4 reflects a long-term plan comprising sequences of conditional actions and states which may be reached as a result of the actions. See provisional application # 62/564,165, filed on 09/27/17, Paragraphs 0025, 0041, & 0043); 
[applying] a discount factor ... the long/short term predictor scores (Column 18, lines 49-65, In at least one embodiment, the ability to generate the V(s) values using a single instance of a DNN model may help to simplify or shorten the training time of the models used for Q(s, a) estimations. For example, in a simplified representation, the value iteration update that is used in learning Q(s, a) values may be formulated as follows: Q(s.sub.t,a.sub.t)← Q(s.sub.t,a.sub.t) + α.Math.(r.sub.t+γ.Math.max.sub.aQ(s.sub.t+1,a) − Q(s.sub.t,a.sub.t)) (2); In formulation (2), α represents the learning rate, r.sub.t is the reward at some time step t, γ is the discount factor, and max.sub.a Q(s.sub.t+1,a) is the estimate of optimal future value. By definition, a model for V(s) would learn to output the same value as max.sub.a Q(s, a). The V(s) model may have to be executed just once to obtain the estimated optimal future value, and this fact may be used to reduce the overall amount of computation required for training iterations of the Q model in various embodiments; See provisional application # 62/564,165, filed on 09/27/17, Paragraphs 0064-0065); 
generate a set of expected rewards corresponding to an action set specific to the data signal using the neural network (see Figure 3 and related text in Column 12, lines 52-67, The encodings of the (state, action) pairs may be included in the input data sets of respective instances 330 (such as instances 330A-330K) of a trained deep neural network based reinforcement learning model which has been deployed at the vehicle in the depicted embodiment. The different instances 330 may be executed at least partly in parallel with one another in some embodiments, e.g., using GPU-based devices (or other computing devices supporting parallelism) incorporated within the vehicle. Each of the instances or executions of the model may generate a quality metric termed a Q-value 370 in the depicted embodiment, indicative of the goodness of the corresponding action given the current state and one or more reward functions associated with vehicle trajectories anticipated to be achieved as a result of implementing the actions. In some embodiments, at least some of the computations of the model instances may be performed using resources that are not incorporated within the vehicle itself—e.g., resources at data centers may be used. In at least one embodiment, at least some instances may not be executed in parallel with one another; See provisional application # 62/564,165, filed on 09/27/17, Paragraph 0043); 
adjust the set of expected rewards based on the discount factor (Column 12, lines 60-66, Each of the instances or executions of the model may generate a quality metric termed a Q-value 370 in the depicted embodiment, indicative of the goodness of the corresponding action given the current state and one or more reward functions associated with vehicle trajectories anticipated to be achieved as a result of implementing the actions; Column 15, lines 29-37, For example, if a particular action (such as suddenly switching a lane to move in front of another vehicle, or passing a bicycle with a very small clearance) is likely to induce negative feelings or negative social reactions among individuals outside (or inside) the vehicle, a lower reward value may be associated with that action and the resulting state (all other factors being equal) than to a state resulting from another action which is less likely to induce negative feelings; Column 18, lines 57-65, In formulation (2), α represents the learning rate, r.sub.t is the reward at some time step t, γ is the discount factor, and max.sub.a Q(s.sub.t+1,a) is the estimate of optimal future value. By definition, a model for V(s) would learn to output the same value as max.sub.a Q(s, a). The V(s) model may have to be executed just once to obtain the estimated optimal future value, and this fact may be used to reduce the overall amount of computation required for training iterations of the Q model in various embodiments; Examiner notes that in Figure 3 lower or higher rewards are given to different actions. Therefore, based on broadest reasonable interpretation in light of the specification, Levihn et al. discloses a “discount factor” because it can give a lower reward to actions that induce negative feelings and social reactions among individuals outside or inside a vehicle. See provisional application # 62/564,165, filed on 09/27/17, Paragraphs 0043, 0051, and 0064-0065); 
select a selected action from the action set based on the set of expected rewards (see Figure 3 and related text in Column 13, lines 5-12, Decision-making components of the vehicle may compare the quality metrics or Q-values 370 associated with the different actions, and identify the “best” action 375 given the current set of information available, based at least in part on the quality metrics. After the best action is selected, a concrete motion plan 380 may be generated to implement the action, and corresponding directives may be sent to the motion-control subsystems of the vehicle; See provisional application # 62/564,165, filed on 09/27/17, Paragraph 0044); 
and initiate the selected action (see Figure 3 and related text in Column 13, lines 5-12, Decision-making components of the vehicle may compare the quality metrics or Q-values 370 associated with the different actions, and identify the “best” action 375 given the current set of information available, based at least in part on the quality metrics. After the best action is selected, a concrete motion plan 380 may be generated to implement the action, and corresponding directives may be sent to the motion-control subsystems of the vehicle; See provisional application # 62/564,165, filed on 09/27/17, Paragraph 0044).
Levihn et al. discloses all the limitations above and a perception subsystem (e.g. to generate abstractions from raw sensor data). Examiner notes that in the field of autonomous vehicle, sensor data abstraction is known as a module that provides a mapping from the actual sensor space to the observation space. Although Levihn et al. discloses “sensor data abstraction” (Figure 1, item 113, perception subsystem) and “mapping sensor data to data stored in a database” (Figure 3, item 306, map data) Levihn et al. does not specifically disclose indicating a level of similarity between a data signal and the one or more predefined patterns.
However, Zhang et al. discloses compare the data signal to one or more predefined patterns … indicating a level of similarity between a data signal and the one or more predefined patterns (Column 38, lines 33-47, The HMI module 5006 may use one or more methodologies, techniques, or technologies of motion detection and confirmation for translating an external object's state information into an acknowledgement. For example, with respect to a pedestrian, sensor (e.g., the sensor 1360 of FIG. 1) data (e.g., images, LiDAR data, etc.) may be compared to templates wherein a template correlates to an acknowledgement. For example, the vehicle may include one or more classifiers trained to recognize gestures, movements, and/or body positions and determine an acknowledgement based on state information associated with an external object. For example, a gesture recognition classifier, such as a Dynamic Time Warping (DTW) algorithm may be used to determine whether a received gesture signal matches a gesture template to identify the gesture; Examiner interprets the match as the level of similarity between a data signal and the one or more predefined patterns. Also, Examiner notes that the Applicant specifies that the pattern may be a template, see Paragraph 0022 of Applicant’s specification).
It would have been obvious to one ordinary skill in the art at the time the invention was filed to modify the computer program product for selecting an action based on reinforced learning, wherein the action is selected based on a particular state of the environment (e.g. objects or feelings identified by the perception subsystem or by the sensors) of the invention of Levihn et al. to further incorporate wherein the state is identified by comparing the data signal to one or more predefined patterns of the invention of Zhang et al. because doing so would allow the program to determine whether a received signal matches a template to identify the gesture (Column 38, lines 33-47, Paragraph 0017). Further, the claimed invention is merely a combination of old elements, and in combination each element would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that the results of the combination were predictable.
Although Levihn et al. discloses applying a discount factor to the long/short term predictor scores (Column 18, lines 49-65, γ is the discount factor), the combination of Levihn et al. and Zhang et al. does not specifically disclose wherein a different discount factor is applied based on the long/short term predictor scores.
However, Bennett et al. discloses adjusting a discount factor based on the long/short term predictor scores (Paragraph 0027, The treatment of time—whether it is continuous or discrete, and (if the latter) how time units are determined—is a critical aspect in any modeling effort, as are the trade-offs between solution quality and solution time. Problems may be either finite-horizon or infinite-horizon. In either case, utilities/rewards of various decisions may be undiscounted or discounted, where discounting increases the importance of short-term utilities/rewards over long-term ones, see A. J. Schaefer, M. D. Bailey, S. M. Shechter, and M. S. Roberts, Modeling Medical Treatment Using Markov Decision Processes, in: M. L. Brandeau, F. Sainfort, and W. P. Pierskalla, eds., Operations Research and Health Care, (Kluwer Academic Publishers, Boston, 2005) 593-612, the disclosures of which are incorporated by reference herein; Examiner notes that a higher discount is applied when predicting short-term rewards and a lower discount is applied when predicting long-term rewards).
It would have been obvious to one ordinary skill in the art at the time the invention was filed to modify the computer program product for selecting an action based on reinforced learning, wherein the expected reward for each action is discounted based on a discount factor of the invention of Levihn et al. to further incorporate adjusting a discount factor based on the long/short term predictor scores of the invention of Bennett et al. because doing so would allow the program to discount rewards of various decisions to increase the importance of short-term utilities/rewards over long-term ones (see Bennett et al., Paragraph 0027). Further, the claimed invention is merely a combination of old elements, and in combination each element would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that the results of the combination were predictable.
Regarding claim 9 (Currently Amended), a computer-implemented method, comprising (Figure 13, item 9000, Computing device; Column 1, lines 43-46, Various embodiments of methods and apparatus for evaluating varying-size action spaces for autonomous vehicles using neural network-based reinforcement learning models are described): 
collecting a set of conditions associated with an operating context in which to apply a neural network, the neural network comprising a multi-layered matrix of nodes corresponding to actions having various weights (Column 14, lines 19-29, FIG. 5 illustrates an example neural network architecture which may be used to generate value metrics for respective encodings of state and action combinations, according to at least some embodiments. A convolutional neural network is illustrated by way of example, comprising an input layer 502 to which an encoding of (state, action) combinations denoted as (s, a.sub.j) may be provided, one or more convolutional network layer groups 510 such as 510A and 510B, and a fully-connected layer 530 at which a single Q-value 540 denoted as Q(s, a.sub.j) associated with a particular action a.sub.j may be generated. Values of various weights, biases and/or other parameters at the different layers may be adjusted using back-propagation during the training phase of the model; Examiner interprets the state and action pairs/combinations as the set of conditions associated with an operating context, wherein each state and action pair is derived from sensor data (as seen in Figure 3); See provisional application # 62/564,165, filed on 09/27/17, Paragraph 0048); 
creating a first training set comprising the collected set of conditions (see Figure 11 and related text in Column 19, lines 45-67, The data may be aggregated at one or more primary model training data centers 1120 in the depicted embodiment. The data centers may comprise numerous computing platforms, storage platforms and the like, from which some number of training platforms 1122 may be selected to train and evaluate neural network-based models 1150 using any of a variety of machine learning algorithms of a library 1124 (e.g., including algorithms which rely on simulations of driver behavior and/or autonomous vehicle behavior as discussed earlier). Trained models 1150, which may for example the types of DNN-based reinforcement learning models discussed earlier, may be transmitted to autonomous vehicles 1172 (e.g., AV 1172A-1172C) of fleets 1170 in the depicted embodiment. The trained models may be executed using local computing resources at the autonomous vehicle and data collected by local sensors of the autonomous vehicles, e.g., to predict vehicle environment states, evaluate and select actions, generate motion control directives to achieve vehicle trajectories which meet safety, efficiency and other desired criteria, and so on. At least a subset of the decisions made at the vehicle, as well as the local sensor data collected, may be transmitted back to the data centers as part of the ongoing data collection approach, and uses to improve and update the models in various embodiments. In some embodiments, updated versions of the models may be transmitted to the autonomous vehicle fleet from the data centers periodically, e.g., as improvements in the model accuracy and/or efficiency are achieved; Examiner interprets the data collected by local sensors of the autonomous vehicles as the collected set of conditions; See provisional application # 62/564,165, filed on 09/27/17, Paragraph 0069); 
training the neural network by applying reinforced learning using the first training set until the neural network produces results within a satisfactory range for operational use, wherein the reinforced learning improves the neural network after a plurality of training cycles, each training cycle comprising determining an action based on the first training set (see Figure 11 and related text in Column 19, lines 45-67, The data may be aggregated at one or more primary model training data centers 1120 in the depicted embodiment. The data centers may comprise numerous computing platforms, storage platforms and the like, from which some number of training platforms 1122 may be selected to train and evaluate neural network-based models 1150 using any of a variety of machine learning algorithms of a library 1124 (e.g., including algorithms which rely on simulations of driver behavior and/or autonomous vehicle behavior as discussed earlier). Trained models 1150, which may for example the types of DNN-based reinforcement learning models discussed earlier, may be transmitted to autonomous vehicles 1172 (e.g., AV 1172A-1172C) of fleets 1170 in the depicted embodiment. The trained models may be executed using local computing resources at the autonomous vehicle and data collected by local sensors of the autonomous vehicles, e.g., to predict vehicle environment states, evaluate and select actions, generate motion control directives to achieve vehicle trajectories which meet safety, efficiency and other desired criteria, and so on. At least a subset of the decisions made at the vehicle, as well as the local sensor data collected, may be transmitted back to the data centers as part of the ongoing data collection approach, and uses to improve and update the models in various embodiments. In some embodiments, updated versions of the models may be transmitted to the autonomous vehicle fleet from the data centers periodically, e.g., as improvements in the model accuracy and/or efficiency are achieved; Examiner notes that the model uses the decisions made by the vehicle to further reevaluate/retrain the neural network. Therefore, improving the neural network over time; See provisional application # 62/564,165, filed on 09/27/17, Paragraph 0069), determining a reward corresponding to the action (Column 15, lines 17-51, As shown, the simulation model 620 may also indicate various parameters or elements of a reward function 614 which may be used to assign values to the attained states and the corresponding actions in the depicted embodiment. The reward associated with a given state or (state, action) sequence may be based on a set of parameters that include, for example, the progress made towards the destination of the journey in progress, the probability of avoiding a collision, the extent to which a set of applicable traffic rules is obeyed by the vehicle, a comfort level of one or more occupants of the vehicle, and/or an anticipated social interaction of an occupant of the first vehicle with one or more individuals outside the first vehicle. For example, if a particular action (such as suddenly switching a lane to move in front of another vehicle, or passing a bicycle with a very small clearance) is likely to induce negative feelings or negative social reactions among individuals outside (or inside) the vehicle, a lower reward value may be associated with that action and the resulting state (all other factors being equal) than to a state resulting from another action which is less likely to induce negative feelings. In various embodiments, the simulation model 612 may be trained at least in part on recorded observations obtained from a large number of real-world journeys of autonomous or non-autonomous vehicles. Training iterations of the form indicated in FIG. 6 may be performed in various embodiments until a convergence criterion is met—e.g., until the action selected in various scenarios by the agent model during the course of a journey results in near-optimal reward function values. After the training is complete, the trained agent model may be deployed as the evaluation model to a fleet of vehicles. In some embodiments, the evaluation model(s) deployed to the fleet may comprise elements or all of the simulation model as well as the agent model; See provisional application # 62/564,165, filed on 09/27/17, Paragraph 0051), determining a difference between the reward and an optimal reward, and updating the various weights of the neural network based on the difference (Column 14, lines 19-29, FIG. 5 illustrates an example neural network architecture which may be used to generate value metrics for respective encodings of state and action combinations, according to at least some embodiments. A convolutional neural network is illustrated by way of example, comprising an input layer 502 to which an encoding of (state, action) combinations denoted as (s, a.sub.j) may be provided, one or more convolutional network layer groups 510 such as 510A and 510B, and a fully-connected layer 530 at which a single Q-value 540 denoted as Q(s, a.sub.j) associated with a particular action a.sub.j may be generated. Values of various weights, biases and/or other parameters at the different layers may be adjusted using back-propagation during the training phase of the model; Column 15, lines 17-51, As shown, the simulation model 620 may also indicate various parameters or elements of a reward function 614 which may be used to assign values to the attained states and the corresponding actions in the depicted embodiment. The reward associated with a given state or (state, action) sequence may be based on a set of parameters that include, for example, the progress made towards the destination of the journey in progress, the probability of avoiding a collision, the extent to which a set of applicable traffic rules is obeyed by the vehicle, a comfort level of one or more occupants of the vehicle, and/or an anticipated social interaction of an occupant of the first vehicle with one or more individuals outside the first vehicle. For example, if a particular action (such as suddenly switching a lane to move in front of another vehicle, or passing a bicycle with a very small clearance) is likely to induce negative feelings or negative social reactions among individuals outside (or inside) the vehicle, a lower reward value may be associated with that action and the resulting state (all other factors being equal) than to a state resulting from another action which is less likely to induce negative feelings. In various embodiments, the simulation model 612 may be trained at least in part on recorded observations obtained from a large number of real-world journeys of autonomous or non-autonomous vehicles. Training iterations of the form indicated in FIG. 6 may be performed in various embodiments until a convergence criterion is met—e.g., until the action selected in various scenarios by the agent model during the course of a journey results in near-optimal reward function values. After the training is complete, the trained agent model may be deployed as the evaluation model to a fleet of vehicles. In some embodiments, the evaluation model(s) deployed to the fleet may comprise elements or all of the simulation model as well as the agent model; Examiner notes that “training until a convergence criterion is met” includes the step of “determining the difference between the reward and an optimal reward” as it’s comparing the reward with the optimal reward to find the actions that result in near-optimal reward function values; See provisional application # 62/564,165, filed on 09/27/17, Paragraphs 0048 & 0051); 
	receiving a data signal (Figure 3, item 304, Objects/entities derived from sensor data; Column 12, lines 20-28, Raw input collected via a variety of sensors (e.g., similar to the sensors of local sensor collection 112, discussed in the context of FIG. 1) may be processed (e.g., using a perception subsystem similar to subsystem 113 of FIG. 1) to generate representations 304 of nearby objects and entities. In addition, in at least some embodiments, map data 306 may be obtained or stored at the vehicle, and such map data may be used in combination with the sensor-derived data to generate a descriptor of the current state 312; See provisional application # 62/564,165, filed on 09/27/17, Paragraph 0041); 
comparing the data signal to one or more predefined patterns to determine one or more long/short term predictor scores ... between a data signal and the one or more predefined patterns (Figure 1, item 113, Perception subsystem, generate abstractions from raw sensor data; Column 6, lines 31-49, According to some embodiments, at various points in time during the course of a journey of the vehicle 110, one or more decision making components 116 (such as the behavior planner 117) may determine the current state of the environment of the vehicle (e.g., its current location and speed, the locations and speeds of other vehicles or objects, and so on). For example, the state may be determined based at least in part on data collected at a local sensor collection 112 and processed at the perception subsystem 113. Corresponding to any given state, a set of feasible or proposed actions may be identified (e.g., by the behavior planner 117 in the depicted embodiment). A given action may be described or represented using a combination of numerous attributes or dimensions, such as a target lane segment which the vehicle may enter, a target speed in that lane segment, relative positioning with respect to other vehicles in the target lane segment (e.g., a position ahead of or behind another vehicle) and so on; Column 12, lines 20-28, Raw input collected via a variety of sensors (e.g., similar to the sensors of local sensor collection 112, discussed in the context of FIG. 1) may be processed (e.g., using a perception subsystem similar to subsystem 113 of FIG. 1) to generate representations 304 of nearby objects and entities. In addition, in at least some embodiments, map data 306 may be obtained or stored at the vehicle, and such map data may be used in combination with the sensor-derived data to generate a descriptor of the current state 312; Column 12, lines 52-66, The encodings of the (state, action) pairs may be included in the input data sets of respective instances 330 (such as instances 330A-330K) of a trained deep neural network based reinforcement learning model which has been deployed at the vehicle in the depicted embodiment. The different instances 330 may be executed at least partly in parallel with one another in some embodiments, e.g., using GPU-based devices (or other computing devices supporting parallelism) incorporated within the vehicle. Each of the instances or executions of the model may generate a quality metric termed a Q-value 370 in the depicted embodiment, indicative of the goodness of the corresponding action given the current state and one or more reward functions associated with vehicle trajectories anticipated to be achieved as a result of implementing the actions; Examiner interprets “generate a descriptor of the current state based on map data 306 stored at the vehicle” as “comparing the data signal to one or more predefined patterns.” Also, Examiner notes that Figure 3 reflects a short-term plan and Figure 4 reflects a long-term plan comprising sequences of conditional actions and states which may be reached as a result of the actions. See provisional application # 62/564,165, filed on 09/27/17, Paragraphs 0025, 0041, & 0043); 
[applying] a discount factor ... the long/short term predictor scores (Column 18, lines 49-65, In at least one embodiment, the ability to generate the V(s) values using a single instance of a DNN model may help to simplify or shorten the training time of the models used for Q(s, a) estimations. For example, in a simplified representation, the value iteration update that is used in learning Q(s, a) values may be formulated as follows: Q(s.sub.t,a.sub.t)← Q(s.sub.t,a.sub.t) + α.Math.(r.sub.t+γ.Math.max.sub.aQ(s.sub.t+1,a) − Q(s.sub.t,a.sub.t)) (2); In formulation (2), α represents the learning rate, r.sub.t is the reward at some time step t, γ is the discount factor, and max.sub.a Q(s.sub.t+1,a) is the estimate of optimal future value. By definition, a model for V(s) would learn to output the same value as max.sub.a Q(s, a). The V(s) model may have to be executed just once to obtain the estimated optimal future value, and this fact may be used to reduce the overall amount of computation required for training iterations of the Q model in various embodiments; See provisional application # 62/564,165, filed on 09/27/17, Paragraphs 0064-0065); 
generate a set of expected rewards corresponding to an action set specific to the data signal using the neural network (see Figure 3 and related text in Column 12, lines 52-67, The encodings of the (state, action) pairs may be included in the input data sets of respective instances 330 (such as instances 330A-330K) of a trained deep neural network based reinforcement learning model which has been deployed at the vehicle in the depicted embodiment. The different instances 330 may be executed at least partly in parallel with one another in some embodiments, e.g., using GPU-based devices (or other computing devices supporting parallelism) incorporated within the vehicle. Each of the instances or executions of the model may generate a quality metric termed a Q-value 370 in the depicted embodiment, indicative of the goodness of the corresponding action given the current state and one or more reward functions associated with vehicle trajectories anticipated to be achieved as a result of implementing the actions. In some embodiments, at least some of the computations of the model instances may be performed using resources that are not incorporated within the vehicle itself—e.g., resources at data centers may be used. In at least one embodiment, at least some instances may not be executed in parallel with one another; See provisional application # 62/564,165, filed on 09/27/17, Paragraph 0043); 
adjusting the set of expected rewards based on the discount factor (Column 12, lines 60-66, Each of the instances or executions of the model may generate a quality metric termed a Q-value 370 in the depicted embodiment, indicative of the goodness of the corresponding action given the current state and one or more reward functions associated with vehicle trajectories anticipated to be achieved as a result of implementing the actions; Column 15, lines 29-37, For example, if a particular action (such as suddenly switching a lane to move in front of another vehicle, or passing a bicycle with a very small clearance) is likely to induce negative feelings or negative social reactions among individuals outside (or inside) the vehicle, a lower reward value may be associated with that action and the resulting state (all other factors being equal) than to a state resulting from another action which is less likely to induce negative feelings; Column 18, lines 57-65, In formulation (2), α represents the learning rate, r.sub.t is the reward at some time step t, γ is the discount factor, and max.sub.a Q(s.sub.t+1,a) is the estimate of optimal future value. By definition, a model for V(s) would learn to output the same value as max.sub.a Q(s, a). The V(s) model may have to be executed just once to obtain the estimated optimal future value, and this fact may be used to reduce the overall amount of computation required for training iterations of the Q model in various embodiments; Examiner notes that in Figure 3 lower or higher rewards are given to different actions. Therefore, based on broadest reasonable interpretation in light of the specification, Levihn et al. discloses a “discount factor” because it can give a lower reward to actions that induce negative feelings and social reactions among individuals outside or inside a vehicle. See provisional application # 62/564,165, filed on 09/27/17, Paragraphs 0043, 0051, and 0064-0065); 
selecting a selected action from the action set based on the set of expected rewards (see Figure 3 and related text in Column 13, lines 5-12, Decision-making components of the vehicle may compare the quality metrics or Q-values 370 associated with the different actions, and identify the “best” action 375 given the current set of information available, based at least in part on the quality metrics. After the best action is selected, a concrete motion plan 380 may be generated to implement the action, and corresponding directives may be sent to the motion-control subsystems of the vehicle; See provisional application # 62/564,165, filed on 09/27/17, Paragraph 0044); 
and initiating the selected action (see Figure 3 and related text in Column 13, lines 5-12, Decision-making components of the vehicle may compare the quality metrics or Q-values 370 associated with the different actions, and identify the “best” action 375 given the current set of information available, based at least in part on the quality metrics. After the best action is selected, a concrete motion plan 380 may be generated to implement the action, and corresponding directives may be sent to the motion-control subsystems of the vehicle; See provisional application # 62/564,165, filed on 09/27/17, Paragraph 0044).
Levihn et al. discloses all the limitations above and a perception subsystem (e.g. to generate abstractions from raw sensor data). Examiner notes that in the field of autonomous vehicle, sensor data abstraction is known as a module that provides a mapping from the actual sensor space to the observation space. Although Levihn et al. discloses “sensor data abstraction” (Figure 1, item 113, perception subsystem) and “mapping sensor data to data stored in a database” (Figure 3, item 306, map data) Levihn et al. does not specifically disclose indicating a level of similarity between a data signal and the one or more predefined patterns.
However, Zhang et al. discloses compare the data signal to one or more predefined patterns … indicating a level of similarity between a data signal and the one or more predefined patterns (Column 38, lines 33-47, The HMI module 5006 may use one or more methodologies, techniques, or technologies of motion detection and confirmation for translating an external object's state information into an acknowledgement. For example, with respect to a pedestrian, sensor (e.g., the sensor 1360 of FIG. 1) data (e.g., images, LiDAR data, etc.) may be compared to templates wherein a template correlates to an acknowledgement. For example, the vehicle may include one or more classifiers trained to recognize gestures, movements, and/or body positions and determine an acknowledgement based on state information associated with an external object. For example, a gesture recognition classifier, such as a Dynamic Time Warping (DTW) algorithm may be used to determine whether a received gesture signal matches a gesture template to identify the gesture; Examiner interprets the match as the level of similarity between a data signal and the one or more predefined patterns. Also, Examiner notes that the Applicant specifies that the pattern may be a template, see Paragraph 0022 of Applicant’s specification).
It would have been obvious to one ordinary skill in the art at the time the invention was filed to modify the computer program product for selecting an action based on reinforced learning, wherein the action is selected based on a particular state of the environment (e.g. objects or feelings identified by the perception subsystem or by the sensors) of the invention of Levihn et al. to further incorporate wherein the state is identified by comparing the data signal to one or more predefined patterns of the invention of Zhang et al. because doing so would allow the program to determine whether a received signal matches a template to identify the gesture (Column 38, lines 33-47, Paragraph 0017). Further, the claimed invention is merely a combination of old elements, and in combination each element would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that the results of the combination were predictable.
Although Levihn et al. discloses applying a discount factor to the long/short term predictor scores (Column 18, lines 49-65, γ is the discount factor), the combination of Levihn et al. and Zhang et al. does not specifically disclose wherein a different discount factor is applied based on the long/short term predictor scores.
However, Bennett et al. discloses adjusting a discount factor based on the long/short term predictor scores (Paragraph 0027, The treatment of time—whether it is continuous or discrete, and (if the latter) how time units are determined—is a critical aspect in any modeling effort, as are the trade-offs between solution quality and solution time. Problems may be either finite-horizon or infinite-horizon. In either case, utilities/rewards of various decisions may be undiscounted or discounted, where discounting increases the importance of short-term utilities/rewards over long-term ones, see A. J. Schaefer, M. D. Bailey, S. M. Shechter, and M. S. Roberts, Modeling Medical Treatment Using Markov Decision Processes, in: M. L. Brandeau, F. Sainfort, and W. P. Pierskalla, eds., Operations Research and Health Care, (Kluwer Academic Publishers, Boston, 2005) 593-612, the disclosures of which are incorporated by reference herein; Examiner notes that a higher discount is applied when predicting short-term rewards and a lower discount is applied when predicting long-term rewards).
It would have been obvious to one ordinary skill in the art at the time the invention was filed to modify the computer program product for selecting an action based on reinforced learning, wherein the expected reward for each action is discounted based on a discount factor of the invention of Levihn et al. to further incorporate adjusting a discount factor based on the long/short term predictor scores of the invention of Bennett et al. because doing so would allow the program to discount rewards of various decisions to increase the importance of short-term utilities/rewards over long-term ones (see Bennett et al., Paragraph 0027). Further, the claimed invention is merely a combination of old elements, and in combination each element would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that the results of the combination were predictable.
Regarding claim 16 (Currently Amended), Levihn et al. discloses a computing device comprising (Figure 13, item 9000, Computing device; Column 1, lines 43-46, Various embodiments of methods and apparatus for evaluating varying-size action spaces for autonomous vehicles using neural network-based reinforcement learning models are described): 
a memory configured to (Figure 13, item 9020, memory): store one or more predefined patterns (Figure 1, item 113, Perception subsystem, generate abstractions from raw sensor data; Examiner notes that in the field of autonomous vehicle, sensor data abstraction is known as a module that provides a mapping from the actual sensor space to the observation space); store an action set (Figure 3, items 320A-320C, encoding of state-action pair); and store a deep neural network (Figure 3, items 330A-330K, Instance of DNN-based reinforcement learning model); 
a receiver configured to receive a data signal (Figure 3, item 304, Objects/entities derived from sensor data; Column 12, lines 20-28, Raw input collected via a variety of sensors (e.g., similar to the sensors of local sensor collection 112, discussed in the context of FIG. 1) may be processed (e.g., using a perception subsystem similar to subsystem 113 of FIG. 1) to generate representations 304 of nearby objects and entities. In addition, in at least some embodiments, map data 306 may be obtained or stored at the vehicle, and such map data may be used in combination with the sensor-derived data to generate a descriptor of the current state 312; See provisional application # 62/564,165, filed on 09/27/17, Paragraph 0041); 
and a processor coupled to the memory and the receiver (Figure 13, Processor 9010, I/O devices 9035, and Main memory 9020), the processor configured to: 
collect a set of conditions associated with an operating context in which to apply a neural network, the neural network comprising a multi-layered matrix of nodes corresponding to actions having various weights (Column 14, lines 19-29, FIG. 5 illustrates an example neural network architecture which may be used to generate value metrics for respective encodings of state and action combinations, according to at least some embodiments. A convolutional neural network is illustrated by way of example, comprising an input layer 502 to which an encoding of (state, action) combinations denoted as (s, a.sub.j) may be provided, one or more convolutional network layer groups 510 such as 510A and 510B, and a fully-connected layer 530 at which a single Q-value 540 denoted as Q(s, a.sub.j) associated with a particular action a.sub.j may be generated. Values of various weights, biases and/or other parameters at the different layers may be adjusted using back-propagation during the training phase of the model; Examiner interprets the state and action pairs/combinations as the set of conditions associated with an operating context, wherein each state and action pair is derived from sensor data (as seen in Figure 3); See provisional application # 62/564,165, filed on 09/27/17, Paragraph 0048); 
create a first training set comprising the collected set of conditions (see Figure 11 and related text in Column 19, lines 45-67, The data may be aggregated at one or more primary model training data centers 1120 in the depicted embodiment. The data centers may comprise numerous computing platforms, storage platforms and the like, from which some number of training platforms 1122 may be selected to train and evaluate neural network-based models 1150 using any of a variety of machine learning algorithms of a library 1124 (e.g., including algorithms which rely on simulations of driver behavior and/or autonomous vehicle behavior as discussed earlier). Trained models 1150, which may for example the types of DNN-based reinforcement learning models discussed earlier, may be transmitted to autonomous vehicles 1172 (e.g., AV 1172A-1172C) of fleets 1170 in the depicted embodiment. The trained models may be executed using local computing resources at the autonomous vehicle and data collected by local sensors of the autonomous vehicles, e.g., to predict vehicle environment states, evaluate and select actions, generate motion control directives to achieve vehicle trajectories which meet safety, efficiency and other desired criteria, and so on. At least a subset of the decisions made at the vehicle, as well as the local sensor data collected, may be transmitted back to the data centers as part of the ongoing data collection approach, and uses to improve and update the models in various embodiments. In some embodiments, updated versions of the models may be transmitted to the autonomous vehicle fleet from the data centers periodically, e.g., as improvements in the model accuracy and/or efficiency are achieved; Examiner interprets the data collected by local sensors of the autonomous vehicles as the collected set of conditions; See provisional application # 62/564,165, filed on 09/27/17, Paragraph 0069); 
train the neural network by applying reinforced learning using the first training set until the neural network produces results within a satisfactory range for operational use, wherein the reinforced learning improves the neural network after a plurality of training cycles, each training cycle comprising determining an action based on the first training set (see Figure 11 and related text in Column 19, lines 45-67, The data may be aggregated at one or more primary model training data centers 1120 in the depicted embodiment. The data centers may comprise numerous computing platforms, storage platforms and the like, from which some number of training platforms 1122 may be selected to train and evaluate neural network-based models 1150 using any of a variety of machine learning algorithms of a library 1124 (e.g., including algorithms which rely on simulations of driver behavior and/or autonomous vehicle behavior as discussed earlier). Trained models 1150, which may for example the types of DNN-based reinforcement learning models discussed earlier, may be transmitted to autonomous vehicles 1172 (e.g., AV 1172A-1172C) of fleets 1170 in the depicted embodiment. The trained models may be executed using local computing resources at the autonomous vehicle and data collected by local sensors of the autonomous vehicles, e.g., to predict vehicle environment states, evaluate and select actions, generate motion control directives to achieve vehicle trajectories which meet safety, efficiency and other desired criteria, and so on. At least a subset of the decisions made at the vehicle, as well as the local sensor data collected, may be transmitted back to the data centers as part of the ongoing data collection approach, and uses to improve and update the models in various embodiments. In some embodiments, updated versions of the models may be transmitted to the autonomous vehicle fleet from the data centers periodically, e.g., as improvements in the model accuracy and/or efficiency are achieved; Examiner notes that the model uses the decisions made by the vehicle to further reevaluate/retrain the neural network. Therefore, improving the neural network over time; See provisional application # 62/564,165, filed on 09/27/17, Paragraph 0069), determining a reward corresponding to the action (Column 15, lines 17-51, As shown, the simulation model 620 may also indicate various parameters or elements of a reward function 614 which may be used to assign values to the attained states and the corresponding actions in the depicted embodiment. The reward associated with a given state or (state, action) sequence may be based on a set of parameters that include, for example, the progress made towards the destination of the journey in progress, the probability of avoiding a collision, the extent to which a set of applicable traffic rules is obeyed by the vehicle, a comfort level of one or more occupants of the vehicle, and/or an anticipated social interaction of an occupant of the first vehicle with one or more individuals outside the first vehicle. For example, if a particular action (such as suddenly switching a lane to move in front of another vehicle, or passing a bicycle with a very small clearance) is likely to induce negative feelings or negative social reactions among individuals outside (or inside) the vehicle, a lower reward value may be associated with that action and the resulting state (all other factors being equal) than to a state resulting from another action which is less likely to induce negative feelings. In various embodiments, the simulation model 612 may be trained at least in part on recorded observations obtained from a large number of real-world journeys of autonomous or non-autonomous vehicles. Training iterations of the form indicated in FIG. 6 may be performed in various embodiments until a convergence criterion is met—e.g., until the action selected in various scenarios by the agent model during the course of a journey results in near-optimal reward function values. After the training is complete, the trained agent model may be deployed as the evaluation model to a fleet of vehicles. In some embodiments, the evaluation model(s) deployed to the fleet may comprise elements or all of the simulation model as well as the agent model; See provisional application # 62/564,165, filed on 09/27/17, Paragraph 0051), determining a difference between the reward and an optimal reward, and updating the various weights of the neural network based on the difference (Column 14, lines 19-29, FIG. 5 illustrates an example neural network architecture which may be used to generate value metrics for respective encodings of state and action combinations, according to at least some embodiments. A convolutional neural network is illustrated by way of example, comprising an input layer 502 to which an encoding of (state, action) combinations denoted as (s, a.sub.j) may be provided, one or more convolutional network layer groups 510 such as 510A and 510B, and a fully-connected layer 530 at which a single Q-value 540 denoted as Q(s, a.sub.j) associated with a particular action a.sub.j may be generated. Values of various weights, biases and/or other parameters at the different layers may be adjusted using back-propagation during the training phase of the model; Column 15, lines 17-51, As shown, the simulation model 620 may also indicate various parameters or elements of a reward function 614 which may be used to assign values to the attained states and the corresponding actions in the depicted embodiment. The reward associated with a given state or (state, action) sequence may be based on a set of parameters that include, for example, the progress made towards the destination of the journey in progress, the probability of avoiding a collision, the extent to which a set of applicable traffic rules is obeyed by the vehicle, a comfort level of one or more occupants of the vehicle, and/or an anticipated social interaction of an occupant of the first vehicle with one or more individuals outside the first vehicle. For example, if a particular action (such as suddenly switching a lane to move in front of another vehicle, or passing a bicycle with a very small clearance) is likely to induce negative feelings or negative social reactions among individuals outside (or inside) the vehicle, a lower reward value may be associated with that action and the resulting state (all other factors being equal) than to a state resulting from another action which is less likely to induce negative feelings. In various embodiments, the simulation model 612 may be trained at least in part on recorded observations obtained from a large number of real-world journeys of autonomous or non-autonomous vehicles. Training iterations of the form indicated in FIG. 6 may be performed in various embodiments until a convergence criterion is met—e.g., until the action selected in various scenarios by the agent model during the course of a journey results in near-optimal reward function values. After the training is complete, the trained agent model may be deployed as the evaluation model to a fleet of vehicles. In some embodiments, the evaluation model(s) deployed to the fleet may comprise elements or all of the simulation model as well as the agent model; Examiner notes that “training until a convergence criterion is met” includes the step of “determining the difference between the reward and an optimal reward” as it’s comparing the reward with the optimal reward to find the actions that result in near-optimal reward function values; See provisional application # 62/564,165, filed on 09/27/17, Paragraphs 0048 & 0051);  
compare the data signal to one or more predefined patterns to determine one or more long/short term predictor scores ... between a data signal and the one or more predefined patterns (Figure 1, item 113, Perception subsystem, generate abstractions from raw sensor data; Column 6, lines 31-49, According to some embodiments, at various points in time during the course of a journey of the vehicle 110, one or more decision making components 116 (such as the behavior planner 117) may determine the current state of the environment of the vehicle (e.g., its current location and speed, the locations and speeds of other vehicles or objects, and so on). For example, the state may be determined based at least in part on data collected at a local sensor collection 112 and processed at the perception subsystem 113. Corresponding to any given state, a set of feasible or proposed actions may be identified (e.g., by the behavior planner 117 in the depicted embodiment). A given action may be described or represented using a combination of numerous attributes or dimensions, such as a target lane segment which the vehicle may enter, a target speed in that lane segment, relative positioning with respect to other vehicles in the target lane segment (e.g., a position ahead of or behind another vehicle) and so on; Column 12, lines 20-28, Raw input collected via a variety of sensors (e.g., similar to the sensors of local sensor collection 112, discussed in the context of FIG. 1) may be processed (e.g., using a perception subsystem similar to subsystem 113 of FIG. 1) to generate representations 304 of nearby objects and entities. In addition, in at least some embodiments, map data 306 may be obtained or stored at the vehicle, and such map data may be used in combination with the sensor-derived data to generate a descriptor of the current state 312; Column 12, lines 52-66, The encodings of the (state, action) pairs may be included in the input data sets of respective instances 330 (such as instances 330A-330K) of a trained deep neural network based reinforcement learning model which has been deployed at the vehicle in the depicted embodiment. The different instances 330 may be executed at least partly in parallel with one another in some embodiments, e.g., using GPU-based devices (or other computing devices supporting parallelism) incorporated within the vehicle. Each of the instances or executions of the model may generate a quality metric termed a Q-value 370 in the depicted embodiment, indicative of the goodness of the corresponding action given the current state and one or more reward functions associated with vehicle trajectories anticipated to be achieved as a result of implementing the actions; Examiner interprets “generate a descriptor of the current state based on map data 306 stored at the vehicle” as “comparing the data signal to one or more predefined patterns.” Also, Examiner notes that Figure 3 reflects a short-term plan and Figure 4 reflects a long-term plan comprising sequences of conditional actions and states which may be reached as a result of the actions. See provisional application # 62/564,165, filed on 09/27/17, Paragraphs 0025, 0041, & 0043); 
[applying] a discount factor ... the long/short term predictor scores (Column 18, lines 49-65, In at least one embodiment, the ability to generate the V(s) values using a single instance of a DNN model may help to simplify or shorten the training time of the models used for Q(s, a) estimations. For example, in a simplified representation, the value iteration update that is used in learning Q(s, a) values may be formulated as follows: Q(s.sub.t,a.sub.t)← Q(s.sub.t,a.sub.t) + α.Math.(r.sub.t+γ.Math.max.sub.aQ(s.sub.t+1,a) − Q(s.sub.t,a.sub.t)) (2); In formulation (2), α represents the learning rate, r.sub.t is the reward at some time step t, γ is the discount factor, and max.sub.a Q(s.sub.t+1,a) is the estimate of optimal future value. By definition, a model for V(s) would learn to output the same value as max.sub.a Q(s, a). The V(s) model may have to be executed just once to obtain the estimated optimal future value, and this fact may be used to reduce the overall amount of computation required for training iterations of the Q model in various embodiments; See provisional application # 62/564,165, filed on 09/27/17, Paragraphs 0064-0065); 
generate a set of expected rewards corresponding to an action set specific to the data signal using the neural network (see Figure 3 and related text in Column 12, lines 52-67, The encodings of the (state, action) pairs may be included in the input data sets of respective instances 330 (such as instances 330A-330K) of a trained deep neural network based reinforcement learning model which has been deployed at the vehicle in the depicted embodiment. The different instances 330 may be executed at least partly in parallel with one another in some embodiments, e.g., using GPU-based devices (or other computing devices supporting parallelism) incorporated within the vehicle. Each of the instances or executions of the model may generate a quality metric termed a Q-value 370 in the depicted embodiment, indicative of the goodness of the corresponding action given the current state and one or more reward functions associated with vehicle trajectories anticipated to be achieved as a result of implementing the actions. In some embodiments, at least some of the computations of the model instances may be performed using resources that are not incorporated within the vehicle itself—e.g., resources at data centers may be used. In at least one embodiment, at least some instances may not be executed in parallel with one another; See provisional application # 62/564,165, filed on 09/27/17, Paragraph 0043); 
adjust the set of expected rewards based on the discount factor (Column 12, lines 60-66, Each of the instances or executions of the model may generate a quality metric termed a Q-value 370 in the depicted embodiment, indicative of the goodness of the corresponding action given the current state and one or more reward functions associated with vehicle trajectories anticipated to be achieved as a result of implementing the actions; Column 15, lines 29-37, For example, if a particular action (such as suddenly switching a lane to move in front of another vehicle, or passing a bicycle with a very small clearance) is likely to induce negative feelings or negative social reactions among individuals outside (or inside) the vehicle, a lower reward value may be associated with that action and the resulting state (all other factors being equal) than to a state resulting from another action which is less likely to induce negative feelings; Column 18, lines 57-65, In formulation (2), α represents the learning rate, r.sub.t is the reward at some time step t, γ is the discount factor, and max.sub.a Q(s.sub.t+1,a) is the estimate of optimal future value. By definition, a model for V(s) would learn to output the same value as max.sub.a Q(s, a). The V(s) model may have to be executed just once to obtain the estimated optimal future value, and this fact may be used to reduce the overall amount of computation required for training iterations of the Q model in various embodiments; Examiner notes that in Figure 3 lower or higher rewards are given to different actions. Therefore, based on broadest reasonable interpretation in light of the specification, Levihn et al. discloses a “discount factor” because it can give a lower reward to actions that induce negative feelings and social reactions among individuals outside or inside a vehicle. See provisional application # 62/564,165, filed on 09/27/17, Paragraphs 0043, 0051, and 0064-0065); 
select a selected action from the action set based on the set of expected rewards (see Figure 3 and related text in Column 13, lines 5-12, Decision-making components of the vehicle may compare the quality metrics or Q-values 370 associated with the different actions, and identify the “best” action 375 given the current set of information available, based at least in part on the quality metrics. After the best action is selected, a concrete motion plan 380 may be generated to implement the action, and corresponding directives may be sent to the motion-control subsystems of the vehicle; See provisional application # 62/564,165, filed on 09/27/17, Paragraph 0044); 
and initiate the selected action (see Figure 3 and related text in Column 13, lines 5-12, Decision-making components of the vehicle may compare the quality metrics or Q-values 370 associated with the different actions, and identify the “best” action 375 given the current set of information available, based at least in part on the quality metrics. After the best action is selected, a concrete motion plan 380 may be generated to implement the action, and corresponding directives may be sent to the motion-control subsystems of the vehicle; See provisional application # 62/564,165, filed on 09/27/17, Paragraph 0044).
Levihn et al. discloses all the limitations above and a perception subsystem (e.g. to generate abstractions from raw sensor data). Examiner notes that in the field of autonomous vehicle, sensor data abstraction is known as a module that provides a mapping from the actual sensor space to the observation space. Although Levihn et al. discloses “sensor data abstraction” (Figure 1, item 113, perception subsystem) and “mapping sensor data to data stored in a database” (Figure 3, item 306, map data) Levihn et al. does not specifically disclose indicating a level of similarity between a data signal and the one or more predefined patterns.
However, Zhang et al. discloses compare the data signal to one or more predefined patterns … indicating a level of similarity between a data signal and the one or more predefined patterns (Column 38, lines 33-47, The HMI module 5006 may use one or more methodologies, techniques, or technologies of motion detection and confirmation for translating an external object's state information into an acknowledgement. For example, with respect to a pedestrian, sensor (e.g., the sensor 1360 of FIG. 1) data (e.g., images, LiDAR data, etc.) may be compared to templates wherein a template correlates to an acknowledgement. For example, the vehicle may include one or more classifiers trained to recognize gestures, movements, and/or body positions and determine an acknowledgement based on state information associated with an external object. For example, a gesture recognition classifier, such as a Dynamic Time Warping (DTW) algorithm may be used to determine whether a received gesture signal matches a gesture template to identify the gesture; Examiner interprets the match as the level of similarity between a data signal and the one or more predefined patterns. Also, Examiner notes that the Applicant specifies that the pattern may be a template, see Paragraph 0022 of Applicant’s specification).
It would have been obvious to one ordinary skill in the art at the time the invention was filed to modify the computer program product for selecting an action based on reinforced learning, wherein the action is selected based on a particular state of the environment (e.g. objects or feelings identified by the perception subsystem or by the sensors) of the invention of Levihn et al. to further incorporate wherein the state is identified by comparing the data signal to one or more predefined patterns of the invention of Zhang et al. because doing so would allow the program to determine whether a received signal matches a template to identify the gesture (Column 38, lines 33-47, Paragraph 0017). Further, the claimed invention is merely a combination of old elements, and in combination each element would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that the results of the combination were predictable.
Although Levihn et al. discloses applying a discount factor to the long/short term predictor scores (Column 18, lines 49-65, γ is the discount factor), the combination of Levihn et al. and Zhang et al. does not specifically disclose wherein a different discount factor is applied based on the long/short term predictor scores.
However, Bennett et al. discloses adjusting a discount factor based on the long/short term predictor scores (Paragraph 0027, The treatment of time—whether it is continuous or discrete, and (if the latter) how time units are determined—is a critical aspect in any modeling effort, as are the trade-offs between solution quality and solution time. Problems may be either finite-horizon or infinite-horizon. In either case, utilities/rewards of various decisions may be undiscounted or discounted, where discounting increases the importance of short-term utilities/rewards over long-term ones, see A. J. Schaefer, M. D. Bailey, S. M. Shechter, and M. S. Roberts, Modeling Medical Treatment Using Markov Decision Processes, in: M. L. Brandeau, F. Sainfort, and W. P. Pierskalla, eds., Operations Research and Health Care, (Kluwer Academic Publishers, Boston, 2005) 593-612, the disclosures of which are incorporated by reference herein; Examiner notes that a higher discount is applied when predicting short-term rewards and a lower discount is applied when predicting long-term rewards).
It would have been obvious to one ordinary skill in the art at the time the invention was filed to modify the computer program product for selecting an action based on reinforced learning, wherein the expected reward for each action is discounted based on a discount factor of the invention of Levihn et al. to further incorporate adjusting a discount factor based on the long/short term predictor scores of the invention of Bennett et al. because doing so would allow the program to discount rewards of various decisions to increase the importance of short-term utilities/rewards over long-term ones (see Bennett et al., Paragraph 0027). Further, the claimed invention is merely a combination of old elements, and in combination each element would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that the results of the combination were predictable.
Regarding claim 2 (Currently Amended), which is dependent of claim 1, the combination of Levihn et al., Zhang et al., and Bennett et al. discloses all the limitations in claim 1. Levihn et al. further discloses wherein the selected action is an action in the action set having a highest expected reward in the set of expected rewards (Column 3, lines 1-19, The method may comprise executing multiple instances of such a model in some embodiments, and obtaining respective value metrics for respective actions from the multiple instances. For example, if four actions a1, a2, a3 and a4 are to be evaluated with respect to a given state s, four instances of the model may be executed in some embodiments. Respective encodings of (s, a1), (s, a2), (s, a3) and (s, a4) may be provided as input data sets to the four instances, and respective estimated value metrics Q(s, a1), Q(s, a2), Q(s, a3) and Q(s, a4) may be obtained from the instances. The estimated value metrics may be used to select a particular action for implementation: e.g., if Q(s, a3) corresponds to the highest of the four value metrics in the above example, a3 may be chosen for implementation; Column 13, lines 5-12, Decision-making components of the vehicle may compare the quality metrics or Q-values 370 associated with the different actions, and identify the “best” action 375 given the current set of information available, based at least in part on the quality metrics. After the best action is selected, a concrete motion plan 380 may be generated to implement the action, and corresponding directives may be sent to the motion-control subsystems of the vehicle See provisional application # 62/564,165, filed on 09/27/17, Paragraphs 0005 & 0043-0044).
Regarding claims 3 and 10 (Original), which are dependent of claims 1 and 9, the combination of Levihn et al., Zhang et al., and Bennett et al. discloses all the limitations in claims 1 and 9. Levihn et al. discloses comparing the data signal to the predefined patterns … to determine similarity indices as long/short term predictor scores (Column 9, lines 32-41, A wide variety of sensors may be included in collection 112 in the depicted embodiment, including externally-oriented cameras, occupant-oriented sensors (which may, for example, include cameras pointed primarily towards occupants' faces, or physiological signal detectors such as heart rate detectors and the like, and may be able to provide evidence of the comfort level or stress level of the occupants), Global Positioning System (GPS) devices, radar devices, LIDAR (light detection and ranging) devices and so on; Figure 3, item 304, Objects/entities derived from sensor data; Figure 3, item 370, Q-value; Column 12, lines 20-28, Raw input collected via a variety of sensors (e.g., similar to the sensors of local sensor collection 112, discussed in the context of FIG. 1) may be processed (e.g., using a perception subsystem similar to subsystem 113 of FIG. 1) to generate representations 304 of nearby objects and entities. In addition, in at least some embodiments, map data 306 may be obtained or stored at the vehicle, and such map data may be used in combination with the sensor-derived data to generate a descriptor of the current state 312; Column 12, lines 52-66, The encodings of the (state, action) pairs may be included in the input data sets of respective instances 330 (such as instances 330A-330K) of a trained deep neural network based reinforcement learning model which has been deployed at the vehicle in the depicted embodiment. The different instances 330 may be executed at least partly in parallel with one another in some embodiments, e.g., using GPU-based devices (or other computing devices supporting parallelism) incorporated within the vehicle. Each of the instances or executions of the model may generate a quality metric termed a Q-value 370 in the depicted embodiment, indicative of the goodness of the corresponding action given the current state and one or more reward functions associated with vehicle trajectories anticipated to be achieved as a result of implementing the actions). 
Although Lecihn et al. discloses: receiving and analyzing sensor data to identify a state (e.g. stress level, comfort level, or objects); and generating a long/short predictor score based on the identified state from the sensors, the combination of Levihn et al., Zhang et al., and Bennett et al. does not specifically disclose applying dynamic time warping to determine similarity indices.  
	However, Zhang et al. discloses wherein comparing the data signal to the predefined patterns includes applying dynamic time warping to determine similarity indices … (Column 38, lines 33-47, The HMI module 5006 may use one or more methodologies, techniques, or technologies of motion detection and confirmation for translating an external object's state information into an acknowledgement. For example, with respect to a pedestrian, sensor (e.g., the sensor 1360 of FIG. 1) data (e.g., images, LiDAR data, etc.) may be compared to templates wherein a template correlates to an acknowledgement. For example, the vehicle may include one or more classifiers trained to recognize gestures, movements, and/or body positions and determine an acknowledgement based on state information associated with an external object. For example, a gesture recognition classifier, such as a Dynamic Time Warping (DTW) algorithm may be used to determine whether a received gesture signal matches a gesture template to identify the gesture).
	It would have been obvious to one ordinary skill in the art at the time the invention was filed to modify the computer program product for selecting an action based on reinforced learning, wherein the action is selected based on a particular state of the environment (e.g. objects or feelings identified by the perception subsystem or by the sensors) of the invention of Levihn et al. to further incorporate wherein the state is identified by comparing the data signal to the predefined patterns includes applying dynamic time warping to determine similarity indices of the invention of Zhang et al. because doing so would allow the program to determine whether a received signal matches a template to identify the gesture (Column 38, lines 33-47, Paragraph 0017). Further, the claimed invention is merely a combination of old elements, and in combination each element would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that the results of the combination were predictable.
Regarding claims 4, 11, and 17 (Original), which are dependent of claims 1, 9, and 16, the combination of Levihn et al., Zhang et al., and Bennett et al. discloses all the limitations in claims 1, 9, and 16. Levihn et al. further discloses wherein the program instructions are further executable by the processor to: extract quantitative data from context sources related to the data signal (Column 9, lines 17-41, Inputs may be collected at various sampling frequencies from individual sensors of the vehicle's sensor collection 112 in different embodiments via an intermediary perception subsystem 113 by the behavior planner 117, the motion selector 118 and/or the action space evaluation models 133. The perception subsystem may generate higher-level objects or abstractions derived from the raw sensor data in various embodiments, which may be more appropriate for analysis by the decision components than the raw sensor data itself. In one embodiment, an intermediary perception subsystem 113 may not be required. Different sensors may be able to update their output at different maximum rates in some embodiments, and as a result the rate at which the output derived from the sensors is obtained at the various decision making components may also vary from one sensor to another. A wide variety of sensors may be included in collection 112 in the depicted embodiment, including externally-oriented cameras, occupant-oriented sensors (which may, for example, include cameras pointed primarily towards occupants' faces, or physiological signal detectors such as heart rate detectors and the like, and may be able to provide evidence of the comfort level or stress level of the occupants), Global Positioning System (GPS) devices, radar devices, LIDAR (light detection and ranging) devices and so on); 
generate context data describing` data signal context based on the quantitative data (Column 9, lines 17-41, Inputs may be collected at various sampling frequencies from individual sensors of the vehicle's sensor collection 112 in different embodiments via an intermediary perception subsystem 113 by the behavior planner 117, the motion selector 118 and/or the action space evaluation models 133. The perception subsystem may generate higher-level objects or abstractions derived from the raw sensor data in various embodiments, which may be more appropriate for analysis by the decision components than the raw sensor data itself. In one embodiment, an intermediary perception subsystem 113 may not be required. Different sensors may be able to update their output at different maximum rates in some embodiments, and as a result the rate at which the output derived from the sensors is obtained at the various decision making components may also vary from one sensor to another. A wide variety of sensors may be included in collection 112 in the depicted embodiment, including externally-oriented cameras, occupant-oriented sensors (which may, for example, include cameras pointed primarily towards occupants' faces, or physiological signal detectors such as heart rate detectors and the like, and may be able to provide evidence of the comfort level or stress level of the occupants), Global Positioning System (GPS) devices, radar devices, LIDAR (light detection and ranging) devices and so on; Examiner interprets the “inputs collected from the sensors” as the context data) ; 
and generate the set of expected rewards corresponding to the action set based in part on the context data (Column 9, lines 2-13, At the vehicle, input collected from local sensors 112 and communication devices 114 may be provided to the model(s) 133 (as well as to other decision making components such as the behavior planner 117 and motion selector 118). The output value metrics of the model(s) 133 may be used at the motion selector and/or the behavior planner to generate motion control directives 134 (such as the logical equivalents of commands to "apply brakes" or "accelerate") in the depicted embodiment, which may be transmitted to the vehicle motion control subsystems 120 to achieve or realize desired movements or trajectories 122; Figure 3, item 370, Q-value; Column 12, lines 20-28, Raw input collected via a variety of sensors (e.g., similar to the sensors of local sensor collection 112, discussed in the context of FIG. 1) may be processed (e.g., using a perception subsystem similar to subsystem 113 of FIG. 1) to generate representations 304 of nearby objects and entities. In addition, in at least some embodiments, map data 306 may be obtained or stored at the vehicle, and such map data may be used in combination with the sensor-derived data to generate a descriptor of the current state 312; Column 12, lines 52-66, The encodings of the (state, action) pairs may be included in the input data sets of respective instances 330 (such as instances 330A-330K) of a trained deep neural network based reinforcement learning model which has been deployed at the vehicle in the depicted embodiment. The different instances 330 may be executed at least partly in parallel with one another in some embodiments, e.g., using GPU-based devices (or other computing devices supporting parallelism) incorporated within the vehicle. Each of the instances or executions of the model may generate a quality metric termed a Q-value 370 in the depicted embodiment, indicative of the goodness of the corresponding action given the current state and one or more reward functions associated with vehicle trajectories anticipated to be achieved as a result of implementing the actions).
Regarding claims 7, 14, and 19 (Original), which are dependent of claims 4, 11, and 17, the combination of Levihn et al., Zhang et al., and Bennett et al. discloses all the limitations in claims 4, 11, and 17. Levihn et al. further discloses wherein the data signal is vehicle sensor data (Column 9, lines 17-41, Inputs may be collected at various sampling frequencies from individual sensors of the vehicle's sensor collection 112 in different embodiments via an intermediary perception subsystem 113 by the behavior planner 117, the motion selector 118 and/or the action space evaluation models 133), wherein the context sources include travel condition data (Column 11, lines 36-42, First, the dynamically changing environment of vehicle 250 may be inherently stochastic rather than deterministic, with noisy rather than full and accurate data (such as velocity, position, or heading) available with respect to other vehicles 201 and other relevant objects (such as debris in the road, potholes, signs, etc.)), and wherein the action set includes an accelerate action, a decelerate action, a constant speed action, a stop action, an emergency stop action, and a change lanes action (Column 6, lines 2-7, The term "autonomous vehicle" may be used broadly herein to refer to vehicles for which at least some motion-related decisions (e.g., whether to accelerate, slow down, change lanes, etc.) may be made, at least at some points in time, without direct input from the vehicle's occupants; Column 8, lines 1-10, The motion selector 118 may determine the content of the directives to be provided to the motion control subsystems (i.e., whether braking to slow speed by X units is required, whether acceleration by Y units is required, whether a tum 5 or lane change is to be implemented, etc.) based on several inputs in the depicted embodiment, including conditional action and state sequences generated by the behavior planner 117 (as indicated by arrow 119), data obtained from sensor collection 112 via perception subsystem 113, and/or value 10 estimates generated for various actions using models 133; Column 9, lines 2-13, At the vehicle, input collected from local sensors 112 and communication devices 114 may be provided to the model(s) 133 (as well as to other decision making components such as the behavior planner 117 and motion selector 118). The output value metrics of the model(s) 133 may be used at the motion selector and/or the behavior planner to generate motion control directives 134 (such as the logical equivalents of commands to "apply brakes" or "accelerate") in the depicted embodiment, which may be transmitted to the vehicle motion control subsystems 120 to achieve or realize desired movements or trajectories 122).
Regarding claim 8, 15, and 20 (Original), which are dependent of claims 4, 11, and 17, the combination of Levihn et al., Zhang et al., and Bennett et al. discloses all the limitations in claims 4, 11, and 17. Although Levihn et al. discloses receiving a data signal and selecting an action based on the expected rewards generated according to reinforced learning, the combination of Levihn et al. and Zhang et al. does not specifically disclose wherein the reinforced learning is applied in the automated medical diagnosis and treatment. 
	However, Bennett et al. discloses wherein the data signal is patient data, wherein the context sources include biometric data (Paragraph 0097, FIGS. 7 and 8 illustrate two exemplary embodiments of a computer based system of the present invention. FIG. 7 is focused on Health Care System 702, which has data collections from several sources including, but not limited to, Electronic Health Record (EHR) database 704 (representing a collection of patient specific information), PACS database 708 (representing a collection of patient specific images and related observed information), and predictive database 706 (representing a collection of population statistics such as disease indications or genetic tendencies). Other sources of information may also be useful to the above described methods and processes, and may be captured in a central location such as Health Care System 702, for example from Patient History 752 or 762, Patient Observation 730 (representing devices that monitor and provide patient data, including but not limited to pulse from a suitable sensor, blood pressure from a suitable sensor, brain wave from a suitable sensor, blood testing from a suitable sensor or a lab report, genetic testing from a micro-array or lab report, etc.), and observations from doctors or other clinicians 720 taken by a computer, tablet, smart phone, etc. Finally, Patient 750 or 760 may provide data in the form of prior or current measurements, personal feelings or observations, or other historical evidence), and wherein the action set includes a change regimen action, a continue regimen action, and a stop treatment action (Paragraph 0007, In one embodiment, autonomous AI software resides within patient monitoring computation devices and within doctor assisting computation devices. Information from such patient monitoring is communicated to the doctor assisting devices and may influence the doctor through a new recommendation or a change in the treatment decisions or beliefs of the doctor. Such AI software then analyzes the effects of these treatment decisions and delivers updated patient-outcome prediction results to the doctor; Paragraph 0027, The treatment of time—whether it is continuous or discrete, and (if the latter) how time units are determined—is a critical aspect in any modeling effort, as are the trade-offs between solution quality and solution time. Problems may be either finite-horizon or infinite-horizon. In either case, utilities/rewards of various decisions may be undiscounted or discounted, where discounting increases the importance of short-term utilities/rewards over long-term ones, see A. J. Schaefer, M. D. Bailey, S. M. Shechter, and M. S. Roberts, Modeling Medical Treatment Using Markov Decision Processes, in: M. L. Brandeau, F. Sainfort, and W. P. Pierskalla, eds., Operations Research and Health Care, (Kluwer Academic Publishers, Boston, 2005) 593-612, the disclosures of which are incorporated by reference herein; Paragraph 0068, The decision-making environment may be modeled as a finite-horizon, undiscounted, sequential decision-making process in which the state st from the state space S consists of a patient's health status at time t. At each time step the physician agent makes a decision to treat or stop treatment (an action at from the binary action space A=10,11). Here time corresponds to the number of treatment sessions since the patient's first visit (typically one session=one week). The physician agent receives rewards/utilities, and is asked to pick actions in order to maximize overall utilities. Similar decision-making models were used in references [19,31,32]. In one exemplary embodiment, this decision is modeled as a dynamic decision network (DDN, a type of dynamic Bayesian network), as seen in FIG. 4; Paragraph 0069, In FIG. 4, the graphic depicts a dynamic decision network for clinical decision-making, with the following types of nodes: a=action (e.g. treatment option, or not treat), s=state (patient's actual underlying health status, not directly observable), o=observation (patient's observed status/outcome, may be missing), c=treatment costs, CPUC=utilities/rewards (cost per unit change of some outcome measure). The subscripts represent time slices (e.g. treatment sessions); Paragraph 0072, The effects of actions on the state may be modeled using a transition model (T R) that encodes the probabilistic effects of various treatment actions; Paragraph 0073, The physician agent's performance objective is often to maximize the improvement in patient health, as measured by the change in CDOI-ORS score at the end of treatment, while minimizing cost of treatment (e.g. by stopping treatment when the probability of further improvement is low), to maximize the utility of the physician's performance; Paragraph 0074, In all cases, the treatment decision is based on the belief state rather than the true underlying state (which cannot be directly observed)—that is, a strategy 7C is defined as a map from belief states to actions: π:2.sup.s A. In other words, the physician agent's reasoning is performed in a space of belief states, which are probability distributions over the patient's health status, b.sub.t(s)=P(s.sub.t=s). For instance, we cannot directly observe a patient's disease state (e.g. diabetes); rather, we take measurements of symptoms (e.g. blood glucose) and attempt to classify the patient into some underlying disease or health state. Furthermore, in approximately 30% of our data, the clinician makes a treatment decision when the CDOI-ORS observation is missing (i.e. partially observable environment), and the belief state must be inferred from previous belief states (see below). The belief state categories are the same as those described above for the true underlying state (High Deterioration, Flatline, etc.) The determination of the treatment decision may be extended to reason optimally when integrating unobserved health factors based on their probabilistic relationship to observed clinical/demographic characteristics, as well as account for non-deterministic effects of variable treatment options).
It would have been obvious to one ordinary skill in the art at the time the invention was filed to modify the computer program product for selecting an action based on reinforced learning of the invention of Levihn et al. to further incorporate wherein the actions include a change regimen action, a continue regimen action, and a stop treatment action of the invention of Bennett et al. because doing so would allow the program to use artificial intelligence to improve decision-making and the fundamental understanding of the healthcare system and clinical process (see Bennett et al., Paragraph 0005). Further, the claimed invention is merely a combination of old elements, and in combination each element would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that the results of the combination were predictable.


Claims 5, 12, and 18 are rejected under 35 U.S.C. 103 as being unpatentable over Levihn et al. (US 11,243,532 B1), in view of Zhang et al. (US 10,434,935 B1), in further view of Bennett et al. (US 2015/0019241 A1), Burhani et al. (US 2019/0361739 A1) and Getson et al. (US 2015/0254765 A1).
Regarding claims 5 and 12 (Original), which are dependent of claims 4, 11, and 17, the combination of Levihn et al., Zhang et al., and Bennett et al. discloses all the limitations in claims 4, 11, and 17. Although Levihn et al. discloses receiving a data signal and selecting an action based on the expected rewards generated according to reinforced learning, the combination of Levihn et al., Zhang et al., and Bennett et al. does not specifically disclose wherein the reinforced learning is applied in the automated investment trading. 
	However, Burhani et al. discloses wherein the data signal is a price indicator for a financial instrument (Paragraph 0055, Feature extraction unit 112 is configured to process input data to compute a variety of features. The input data can represent a trade order. Example features include pricing features, volume features, time features, Volume Weighted Average Price features, market spread features; See provisional application # 62/676,707, filed on 05/25/18, Paragraph 0032), wherein the context sources are financial data documents related to the price indicator (Paragraph 0048, 170 to receive input data and receive output data for storage. In some embodiments, the input data can represent trade orders, quotes, and/or other market data; See provisional application # 62/676,707, filed on 05/25/18, Paragraph 0026), and wherein the action set includes a buy action, a sell action, … (Paragraph 0084-0088, In some embodiments, the computing system 100 can configured interface application 130 with different hot keys for triggering control commands which can trigger different operations by computing system 100. One Hot Key for Buy and Sell: In some embodiments, the computing system 100 can configured interface application 130 with different hot keys for triggering control commands. An array representing one hot key encoding for Buy and Sell signals can be provided as follows: Buy: [1, 0] Sell: [0, 1]; See provisional application # 62/676,707, filed on 05/25/18, Paragraphs 0061-0063).
It would have been obvious to one ordinary skill in the art at the time the invention was filed to modify the computer program product for selecting an action based on reinforced learning of the invention of Levihn et al. to further incorporate wherein the action set includes a buy and a sell action of the invention of Burhani et al. because doing so would allow the program to make a new decision based on the new state observation using reinforcement learning networks (see Burhani et al., Paragraph 0068). Further, the claimed invention is merely a combination of old elements, and in combination each element would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that the results of the combination were predictable.
	Although Levihn et al. and Burhani et al. discloses all the limitations above and wherein the action includes a buy action and a sell action, the combination of Levihn et al. and Burhani et al. does not specifically disclose a hold action.
	However, Getson et al. discloses wherein the action set includes a buy action, a sell action, and a hold action (Paragraph 0026, At step 108, a decision is made based upon a relationship between past and present values at each occurrence of the first evaluation bar characteristic for the predetermined historical period of time. For example, with respect to the calculated data from steps 106 and 107, going back five years, at each 31 minute interval, a present value of the RSI is calculated. The present value of the RSI is compared to a past value of the RSI calculated 31 minutes beforehand. If a relationship of present value to past value is one of higher, lower or no change, a corresponding trading decision such as buy, sell or hold is made).
It would have been obvious to one ordinary skill in the art at the time the invention was filed to modify the computer program product for selecting an action based on reinforced learning, wherein the action set includes a buy action, a sell action of the invention of Levihn et al. and Burhani et al. to further incorporate a hold action of the invention of Getson et al. because doing so would allow the program to make a decision based upon a relationship between past and present values, wherein the trading decisions include to buy, sell, or hold (see Getson et al., Paragraph 0026). Further, the claimed invention is merely a combination of old elements, and in combination each element would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that the results of the combination were predictable.
Regarding claim 18 (Original), which are dependent of claim 17, the combination of Levihn et al., Zhang et al., and Bennett et al. discloses all the limitations in claim 17. Although Levihn et al. discloses receiving a data signal and selecting an action based on the expected rewards generated according to reinforced learning, the combination of Levihn et al. and Zhang et al. does not specifically disclose wherein the reinforced learning is applied in the automated investment trading. 
	However, Burhani et al. discloses wherein the data signal is a price indicator for a financial instrument (Paragraph 0055, Feature extraction unit 112 is configured to process input data to compute a variety of features. The input data can represent a trade order. Example features include pricing features, volume features, time features, Volume Weighted Average Price features, market spread features; See provisional application # 62/676,707, filed on 05/25/18, Paragraph 0032), wherein the context sources are financial data documents related to the price indicator (Paragraph 0048, 170 to receive input data and receive output data for storage. In some embodiments, the input data can represent trade orders, quotes, and/or other market data; See provisional application # 62/676,707, filed on 05/25/18, Paragraph 0026), and wherein the action set includes a buy action, a sell action, … (Paragraph 0084-0088, In some embodiments, the computing system 100 can configured interface application 130 with different hot keys for triggering control commands which can trigger different operations by computing system 100. One Hot Key for Buy and Sell: In some embodiments, the computing system 100 can configured interface application 130 with different hot keys for triggering control commands. An array representing one hot key encoding for Buy and Sell signals can be provided as follows: Buy: [1, 0] Sell: [0, 1] ; See provisional application # 62/676,707, filed on 05/25/18, Paragraphs 0061-0063).
It would have been obvious to one ordinary skill in the art at the time the invention was filed to modify the computer program product for selecting an action based on reinforced learning of the invention of Levihn et al. to further incorporate wherein the action set includes a buy and a sell action of the invention of Burhani et al. because doing so would allow the program to make a new decision based on the new state observation using reinforcement learning networks (see Burhani et al., Paragraph 0068). Further, the claimed invention is merely a combination of old elements, and in combination each element would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that the results of the combination were predictable.
	Although Levihn et al. and Burhani et al. discloses all the limitations above and wherein the action includes a buy action and a sell action, the combination of Levihn et al. and Burhani et al. does not specifically disclose a hold action, a buy to cover action, and a sell short action.
	However, Getson et al. discloses wherein the action set includes a buy action, a sell action, a hold action, a buy to cover action, and a sell short action (Paragraph 0026, At step 108, a decision is made based upon a relationship between past and present values at each occurrence of the first evaluation bar characteristic for the predetermined historical period of time. For example, with respect to the calculated data from steps 106 and 107, going back five years, at each 31 minute interval, a present value of the RSI is calculated. The present value of the RSI is compared to a past value of the RSI calculated 31 minutes beforehand. If a relationship of present value to past value is one of higher, lower or no change, a corresponding trading decision such as buy, sell or hold is made; Paragraph 0051, Technical indicators may include anything that indicate whether to buy or sell a tradable item based on activity. The activity may include any of price activity, volume activity, time activity, market activity, economic activity, or weather activity. Tradable items may include any of an index, a stock, a bond, a commodity, a sports score, an article of real estate, or another asset. Decisions supported and/or executed by the exemplary systems and methods described herein may include any of buying, selling, selling short, and/or buying to cover).
It would have been obvious to one ordinary skill in the art at the time the invention was filed to modify the computer program product for selecting an action based on reinforced learning, wherein the action set includes a buy action, a sell action of the invention of Levihn et al. and Burhani et al. to further incorporate a buy to cover action and a sell short action of the invention of Getson et al. because doing so would allow the program to make a decision based on activities, wherein the decisions include any of buying, selling, selling short, and/or buying to cover (see Getson et al., Paragraph 0051). Further, the claimed invention is merely a combination of old elements, and in combination each element would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that the results of the combination were predictable.

Claims 6 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Levihn et al. (US 11,243,532 B1), in view of Zhang et al. (US 10,434,935 B1), in further view of Bennett et al. (US 2015/0019241 A1) and Getson et al. (US 2015/0254765 A1).
Regarding claims 6 and 13 (Original), which are dependent of claims 4 and 11, the combination of Levihn et al., Zhang et al., and Bennett et al. discloses all the limitations in claims 4 and 11. Although Levihn et al. discloses receiving a data signal and selecting an action based on the expected rewards generated according to reinforced learning, Levihn et al., Zhang et al., and Bennett et al. does not specifically disclose wherein the reinforced learning is applied in the automated investment trading.
	However, Getson et al. discloses wherein the action set includes a buy to cover action and a sell short action (Paragraph 0051, Technical indicators may include anything that indicate whether to buy or sell a tradable item based on activity. The activity may include any of price activity, volume activity, time activity, market activity, economic activity, or weather activity. Tradable items may include any of an index, a stock, a bond, a commodity, a sports score, an article of real estate, or another asset. Decisions supported and/or executed by the exemplary systems and methods described herein may include any of buying, selling, selling short, and/or buying to cover).
It would have been obvious to one ordinary skill in the art at the time the invention was filed to modify the computer program product for selecting an action based on reinforced learning of the invention of Levihn et al. to further incorporate wherein the action set includes a buy to cover action and a sell short action of the invention of Getson et al. because doing so would allow the program to make a decision based on activities, wherein the decisions include any of buying, selling, selling short, and/or buying to cover (see Getson et al., Paragraph 0051). Further, the claimed invention is merely a combination of old elements, and in combination each element would have performed the same function as it did separately, and one of ordinary skill in the art would have recognized that the results of the combination were predictable.















Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure.
Modayil (Modayil, J., White, A. and Sutton, R.S., 2014. Multi-timescale nexting in a reinforcement learning robot. Adaptive Behavior, 22(2), pp.146-160) – discloses a discount rate for an individual prediction to vary over time depending on the state the robot finds itself in (see Page 154, 5 Beyond simple timescales).

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to MARJORIE PUJOLS-CRUZ whose telephone number is (571)272-4668. The examiner can normally be reached Mon-Thru 7:30 AM - 5:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Patricia H Munson can be reached on (571)270-5396. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/M.P./Examiner, Art Unit 3624                                                                                                                                                                                                        /PATRICIA H MUNSON/Supervisory Patent Examiner, Art Unit 3624