DETAILED ACTION

The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114

A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 12/13/2021 has been entered.

This action is responsive to the original application filed on 3/7/2018 and the Remarks and Amendments filed on 12/13/2021.

Claim Rejections - 35 USC § 103

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of 

The following is a quotation of 35 U.S.C. § 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-4, 6, 8-10, 12-15, and 17-20 are rejected under 35 U.S.C. 103 as being obvious over Shalev-Shwartz et al. (US 20180032082 A1, hereinafter “Shalev-Shwartz”) in view of Englard et al. (US 20190113927 A1, hereinafter “Englard”).

Regarding claim 1, Shalev-Schwartz discloses [a] computer-implemented method for deep reinforcement learning to control a subject device, the method comprising: (Abstract; “Systems and methods are provided for navigating an autonomous vehicle using reinforcement learning techniques”, which discloses a method for deep reinforcement learning to control a subject to device or autonomous vehicle; and [0072]; “Each memory 140, 150 may include software instructions that when executed by a processor (e.g., applications processor 180 and/or image processor 190), may control operation of various aspects of system 100. These memory units may include various databases and image processing software, as well as a trained system, such as a neural network, or a deep neural network, for example”, suggesting the use of a deep system that uses a deep neural network as well as a computer to implement the method)
training, by a processor, a neural network to receive state information of a target of the subject device as an input and provide action information for the target as an output; ([0178]; “Returning to driving policy module 803, in some embodiments, a trained system trained through reinforcement learning may be used to implement driving policy module 803”; and [0179]; “Training of the system using reinforcement learning may involve learning a driving policy in order to map from sensed states to navigational actions. A driving policy is a function .pi.:S.fwdarw.A, where S is a set of states and A .OR right..sup.2 is the action space (e.g., desired speed, acceleration, yaw commands, etc.). The state space is S=S.sub.s.times.S.sub.p, where S.sub.s is the sensing state and S.sub.p is additional information on the state saved by the policy”, which discloses the training of a system using reinforcement learning that receives state information of a target (sensed states) as input and provide an action information for the target (action space) as output; and [0173]; “Furthermore, any of the modules (e.g., modules 801, 803, and 805) disclosed herein may implement techniques associated with a trained system (such as a neural network or a deep neural network) or an untrained system”, which discloses that the trained system is a trained neural network that receives state information and provides action information; and [0174]; “The sensed state may include sensed information relating to target vehicles, lane markings, pedestrians, traffic lights, road geometry, lane shape, obstacles, distances to other objects/vehicles, relative velocities, relative accelerations, among any other potential sensed information”, which discloses tha the target of the subject device is a target vehicle”; and [0180]; “Based on the reward feedback, the system may "learn" the policy and becomes trained in producing desirable navigational actions. For example, the learning system may observe the current state s.sub.t .di-elect cons. S and decide on an action a.sub.t .di-elect cons; and [0203]; “target lane”, which further discloses the target; and [0176]; “The output of driving policy module 803 may include at least one navigational action for the host vehicle and may include a desired acceleration (which may translate to an updated speed for the host vehicle), a desired yaw rate for the host vehicle, a desired trajectory, among other potential desired navigational actions”, which discloses that the choice of trajectory is one of the actions that is used as output)
inputting, by the processor, current state information of the target into the neural network to obtain current action information for the target; ([0179]; “Working in discrete time intervals, at time t, the current state s.sub.t .di-elect cons. S may be observed, and the policy may be applied to obtain a desired action, a.sub.t=.pi.(s.sub.t)”, which discloses inputting current state information into the neural network (which is disclosed as the trained system in [0173]) to obtain current action information for the target (desired action)
correcting, by the processor, the current action information to obtain corrected action information that meets a set of constraints ([0200]; “The set of Desires, together with a set of hard constraints that are defined directly based on the sensed state, establish an optimization problem whose solution is the trajectory for the vehicle. The hard constraints may be employed to further increase the safety of the system, and the Desires can be used to provide driving comfort and human-like driving behavior of the system. The trajectory provided as a solution to the optimization problem, in turn, defines the commands that should be provided to the steering, braking, and/or engine actuators in order to accomplish the trajectory”; and [0217]; “Additionally, in some embodiments, an extra layer of safety may be provided by passing the selected actions of the trained system through one or more hard constraints implicated by a particular sensed scene in the environment of the host vehicle”, which discloses, under a broadest reasonable interpretation of the claim language and in view of the 112(b) indefiniteness rejection above, correcting (by passing the actions through a set of hard constraints) selected or current actions minimally to obtain corrected action information that meets a set of constraints; and [0219]; “For example, the machine learning system may be trained using a desired set of constraints as training guidelines and, therefore, the trained system may select an action in response to a sensed navigational state that accounts for and adheres to the limitations of applicable navigational constraints”)
performing a velocity dampening action by the subject device based on the corrected action information for the target to obtain a reward from the target ([0218]; “In some implementations, the learning algorithm is a deep learning algorithm. The desired actions may include at least one action expected to maximize an anticipated reward for a vehicle. While in some cases, the actual action taken by the vehicle may correspond to one of the desired actions, in other cases, the actual action taken may be determined based on the observed state, one or more desired actions, and non-learned, hard constraints (e.g., safety constraints) imposed on the learning navigational engine”; and [0221]; “For example, considering a reward function for which R.(s)=-r for trajectories that represent a rare "corner" event to be avoided (e.g., such as an accident), and R.(s) .di-elect cons.|-1, 1] for the rest of the trajectories, one goal for the learning system may be to learn to perform an overtake maneuver”; and [0252]; “In some embodiments, the at least one processing device may translate the desired navigation action directly into navigational commands using, for example, control module 805. In other embodiments, however, hard constraints may be applied such that the desired navigational action provided by the driving policy module 803 is tested against various predetermined navigational constraints that may be implicated by the scene and the desired navigational action”, which discloses that the performed action is a navigational action by the subject device or car; and [0180]; “The system may be trained through exposure to various navigational states, having the system apply the policy, providing a reward (based on a reward function designed to reward desirable navigational behavior; and [0220]; “At its core, the navigational system may include a learning algorithm based on a policy function that maps an observed state to one or more desired actions. In some implementations, the learning algorithm is a deep learning algorithm. The desired actions may include at least one action expected to maximize an anticipated reward for a vehicle”; and [0220]; “As discussed, the reinforcement learning objective by policy may be optimized through stochastic gradient ascent. The objective (e.g., the expected reward) may be defined as .sub.s.about.P.sub.6 R.(s)”).
Shalev-Schwartz fails to explicitly disclose but Englard discloses including a velocity constraint, wherein when the subject device and the target are closer than a fixed predetermined influence distance, the velocity constraint is activated on the subject device to maintain a fixed predetermined shorter distance between the subject device and the target than the fixed predetermined influence distance using velocity control to avoid a subject device to target collision; performing a velocity dampening action ([0223]; “At block 990, values of one or more dependent variables in the objective equation are determined by solving the objective equation (with the set independent variable values plugged into the terms) subject to a set of constraints. The dependent variables may correspond to any suitable type(s) of planned movement for the autonomous vehicle, such as changes to specific operational parameters of the vehicle (e.g., speed, braking force, or steering direction) or, in some embodiments, changes to the desired position and heading of the vehicle that may later be converted to specific operational parameter” (emphasis added), which discloses the constraints in the form of a velocity or speed constraint; and [0163]; “The optimizer 744 then solves for one or more dependent variables of the objective equation, by solving the objective equation subject to a set of constraints. The term value generator 740 and optimizer 744 may collectively be viewed as an MPC motion planner of the SDCA 700. The set of constraints may include (i) one or more constraints that are determined using a physical model of the autonomous vehicle and/or (ii) one or more constraints that are determined from driving decisions made by one or more human drivers . . . A physical model of the autonomous vehicle may include a number of parameters that affect how the vehicle operates, such as for example: the vehicle dimensions, shape, and/or weight; the number, size, and location of the tires; the forces on the tires; limits on the vehicle (e.g., acceleration, braking, and/or steering limits);” (emphasis added), which further discloses the velocity constraint; and [0055]; “For example, one or more of the SDCAs 104 may output specific velocity and direction parameters (e.g., absolute speed and direction, or changes from current speed and direction), others may output allowed ranges of velocity and direction parameters, and others may output disallowed ranges of velocity and direction parameter”; and [0221]; “The objective equation may have terms that each correspond to a different one of a number of driving objectives/goals over a finite time horizon (e.g., eight time steps each 0.5 seconds apart, or ten time steps each 0.25 seconds apart, etc.). For example, a first term may reflect the objective of staying at least some predetermined distance (e.g., 5 m, 10 m, etc.) away from a particular vehicle (e.g., a specific vehicle that was identified at block 984), a second term may reflect the objective of staying at least some predetermined distance (e.g., 10 m, 20 m, etc.) away from another particular vehicle that is behaving erratically, a third term may reflect the objective of staying at least some predetermined distance (e.g., 0.25 m, 0.5 m, etc.) away from any observed lane markings, a fourth term may reflect the objective of staying under two miles per hour over the speed limit, and so on” (emphasis added), which discloses, under a broadest reasonable interpretation of the claim language, the fixed predetermined influence distance (a predetermined distance such as 20 m away from another particular vehicle that is behaving erratically), and the velocity constraint is activated on the subject device to maintain a fixed predetermined shorter distance (a predetermined distance such as 5 m away from a particular vehicle, 5 m being shorter than 20 m in the cited example) to avoid a subject (driving vehicle) to target (other vehicle on the road) collision; and Figure 9, Element 992 and Figure 16, Element 926;  the elements in the figures disclose the activation of the velocity constraint on the vehicle)
performing a velocity dampening action (Figure 9, Element 992 and Figure 16, Element 926 and [0055]).
Shalev-Schwartz and Englard are analogous art because both are concerned with intelligent vehicle navigation.  Before the effective filing date of the claimed invention, it would have been obvious to one skilled in intelligent vehicle navigation to combine the velocity constraint and dampening action of Englard with the method of Shalev-Schwartz to yield the predictable result of correcting, by the processor, the current action information to obtain corrected action information that meets a set of constraints including a velocity constraint, wherein when the subject device and the target are closer than a fixed predetermined influence distance, the velocity constraint is activated on the subject device to maintain a fixed predetermined shorter distance between the subject device and the target than the fixed predetermined influence distance using velocity control to avoid a subject device to target collision; and performing a velocity dampening action by the subject device based on the corrected action information for the target and the velocity constraint to obtain a reward from the target. The motivation for doing so would be to implement a self-driving control architecture for controlling an autonomous vehicle (Englard; Abstract).


Regarding claim 12, it is a computer program product claim corresponding to the steps of claim 1, and is rejected for the same reasons as claim 1.

Regarding claim 20, it is a system claim corresponding to the steps of claim 1, and is rejected for the same reasons as claim 1.

Regarding claims 2 and 13, the rejection of claims 1 and 12 are incorporated and Shalev-Schwartz fails to explicitly disclose and Englard discloses calculating a cost that quantifies how much the current action information violates the set of constraints and ([0152-0153]; “In the example SDCA 600, however, the occupancy grids of the perception signals 608 and prediction signals 622 are processed by a cost map generator 640 that outputs cost maps 644.  . . . Unlike occupancy grids, however, the cells of a cost map may specify numerical values representing a “cost” of the autonomous vehicle occupying certain positions at a given point in time. A higher cost may correspond to a less desirable (e.g., riskier) location for the vehicle to be in. If another vehicle is immediately in front of the autonomous vehicle, for example, cells of a current cost map that are immediately behind the leading vehicle may be associated with a high cost, while cells that trail the leading vehicle by a larger distance may be associated with lower costs”, which discloses calculating a cost that quantifies how much the current action information violates a set of constraints, such as a distance to a target; and [00155-0159]; the paragraphs further disclose calculating a cost that quantifies how much the current action information violates the set of constraints)
updating the neural network using the cost, the reward, and a relation between the current action information and the corrected action information ([0079]; “n other embodiments, reinforcement learning is used to train the neural network 144 to select particular ones of the SDCAs 104 in particular conditions and/or situations. With reinforcement learning, at each of a number of different times (e.g., periodically, or on another suitable time basis), the neural network 144 observes the candidate decisions 106, decides to take an action (e.g., select a particular candidate decision), and potentially receives or recognizes a “reward” based on “results” of that action. Generally, the neural network 144 seeks to learn a mapping of states to actions (e.g., a mapping of candidate decision sets to final decisions) that maximizes the rewards over some suitable time interval or intervals.”(emphasis added); and [0080]; and [0198]; “In some embodiments, the selection at block 924 is made by an “arbitration” machine learning model that is trained to dynamically select from among the candidate decisions of different SDCAs based on observed or expected circumstances of the autonomous vehicle. The arbitration model may be trained using reinforcement learning, for example, using rewards for avoiding safety violations (e.g., crashes, and/or disobeying rules of the road, etc.), rewards for executing a particular driving style (e.g., aggressive/fast, or smooth with low G-force levels, etc.), and/or any other suitable type or types of rewards” (emphasis added); and Figure 18, Element 968; and Claims 1 and 2).
The motivation to combine Shalev-Shwartz and Englard is the same as discussed above with respect to claim 1.

Regarding claims 3 and 14, the rejection of claims 1, 2, 12, and 13 are incorporated and Shalev-Schwartz further discloses wherein said updating step is performed by a reinforcement learning algorithm that uses states, actions and rewards ([0180]; “The system may be trained through exposure to various navigational states, having the system apply the policy, providing a reward (based on a reward function designed to reward desirable navigational behavior). Based on the reward feedback, the system may "learn" the policy and becomes trained in producing desirable navigational actions. For example, the learning system may observe the current state s.sub.t .di-elect cons. S and decide on an action a.sub.t .di-elect cons. A based on a policy .pi.:S.fwdarw.(A). Based on the decided action (and implementation of the action), the environment moves to the next state s.sub.t+1 .di-elect cons. S for observation by the learning system. For each action developed in response to the observed state, the feedback to the learning system is a reward signal r.sub.1, r.sub.2.”).

Regarding claims 4 and 15, the rejection of claims 1 and 12 are incorporated and Shalev-Schwartz further discloses wherein the reward is based on a metric selected from a group consisting of proximity of the subject device to the target and collision avoidance by the subject device ([0221]; “For example, considering a reward function for which R.(s)=-r for trajectories that represent a rare "corner" event to be avoided (e.g., such as an accident), and R.(s) .di-elect cons.|-1, 1] for the rest of the trajectories, one goal for the learning system may be to learn to perform an overtake maneuver. Normally, in an accident free trajectory, R(s) would reward successful, smooth, takeovers and penalize staying in a lane without completing the takeover--hence the range [-1, 1]. If a sequence, S, represents an accident, the reward, -r, should provide a sufficiently high penalty to discourage such an occurrence”, which discloses that the reward is based on a metric selected from collision avoidance (a sequence that represents an accident) by the subject device).

Regarding claims 6 and 17, the rejection of claims 1 and 12 are incorporated and Shalev-Schwartz further discloses wherein the subject device comprises a processor-based controllable physical object ([0177]; “Based on the output from the driving policy module 803, control module 805, which may also be implemented using processing unit 110, may develop control instructions for one or more actuators or controlled devices associated with the host vehicle”).

Regarding claims 8 and 18, the rejection of claims 1 and 12 are incorporated and Shalev-Schwartz further discloses wherein, during a training stage, the neural network is trained to learn initially from uncorrected actions and discounted rewards and subsequently learn from corrected actions and full rewards in order to optimize neural network performance during an inference stage ([0181-0182]; “It is usually assumed that at time t, there is a reward function r.sub.t which measures the instantaneous quality of being at state s.sub.t and taking action a.sub.t. However, taking the action a.sub.t at time t affects the environment and therefore affects the value of the future states. As a result, when deciding on what action to take, not only should the current reward be taken into account, but future rewards should also be considered. In some instances the system should take a certain action, even though it is associated with a reward lower than another available option, when the system determines that in the future a greater reward may be realized if the lower reward option is taken now . . . Instead of restricting the time horizon to T, the future rewards may be discounted”, which discloses the discounted rewards at an initial time and subsequently learning from corrected actions through reinforcement learning in order to optimize network performance; and [0296]; “RL may be performed in a sequence of consecutive rounds. At round t, the planner (a.k.a. the agent or driving policy module 803) may observe a state, s.sub.t .di-elect cons. S, which represents the agent as well as the environment. It then should decide on an action a.sub.t .di-elect cons. A. After performing the action, the agent receives an immediate reward, r.sub.t .di-elect cons. , and is moved to a new state, s.sub.t+1.  . . . .The goal of the planner is to maximize the cumulative reward (maybe up to a time horizon or a discounted sum of future rewards). To do so, the planner may rely on a policy, .pi.:S.fwdarw.A, which maps a state into an action”, the maximizing of the cumulative award being optimizing the neural network performance.  This is all done during the training stage for a neural network that uses reinforcement learning as discussed above).

Regarding claims 9 and 19, the rejection of claims 1, 8, 12, and 18 are incorporated and Shalev-Schwartz further discloses performing reward evolution based on collision penalization to transition from the discounted rewards to the full reward ([0181-0182]; “It is usually assumed that at time t, there is a reward function r.sub.t which measures the instantaneous quality of being at state s.sub.t and taking action a.sub.t. However, taking the action a.sub.t at time t affects the environment and therefore affects the value of the future states. As a result, when deciding on what action to take, not only should the current reward be taken into account, but future rewards should also be considered. In some instances the system should take a certain action, even though it is associated with a reward lower than another available option, when the system determines that in the future a greater reward may be realized if the lower reward option is taken now . . . Instead of restricting the time horizon to T, the future rewards may be discounted”, which discloses the reward evolution or optimization; and [0296]; “RL may be performed in a sequence of consecutive rounds. At round t, the planner (a.k.a. the agent or driving policy module 803) may observe a state, s.sub.t .di-elect cons. S, which represents the agent as well as the environment. It then should decide on an action a.sub.t .di-elect cons. A. After performing the action, the agent receives an immediate reward, r.sub.t .di-elect cons. , and is moved to a new state, s.sub.t+1.  . . . .The goal of the planner is to maximize the cumulative reward (maybe up to a time horizon or a discounted sum of future rewards). To do so, the planner may rely on a policy, .pi.:S.fwdarw.A, which maps a state into an action”, which discloses the reward evolution or optimization).

Regarding claim 10, the rejection of claim 1 is incorporated and Shalev-Schwartz further discloses wherein the set of constraints comprise safety constraints ([0200]; “The hard constraints may be employed to further increase the safety of the system”; and [0209]; “Nevertheless, as a redundant safety measure, hard constraints may be applied to the output of driving policy module 803 even where driving policy module 803 has been trained to account for predetermined hard constraints”).




Claims 5 and 16 are rejected under 35 U.S.C. § 103 as being obvious over Shalev-Schwartz in view of Englard and further in view of Wolf et al. (Wolf et al., “Learning How to Drive in a Real World Simulation with Deep Q-Networks”, June 14, 2017, 2017 IEEE Intelligent Vehicles Symposium (IV), pp. 244-250, hereinafter “Wolf”).


Regarding claims 5 and 16, the rejection of claims 1, 4, 12, and 15 are incorporated but Shalev-Schwartz fails to explicitly disclose wherein the proximity of the subject device to the target comprises a first and a second reward basis, the first reward basis comprising providing a reward when a distance from the subject device to the target is less than a first threshold distance, and the second reward basis comprising providing a bonus reward when the distance from the subject device to the target device is less than a second threshold distance, wherein the first threshold distance is greater than the second threshold distance.
Wolf discloses wherein the proximity of the subject device to the target comprises a first and a second reward basis, (Page 247, Column 1; “We construct two different reward functions: a naive distance-based version and an extension adding action-based rewards”, which discloses a first and second reward basis)
the first reward basis comprising providing a reward when a distance from the subject device to the target is less than a first threshold distance, and (Page 247, Column 1; “Distance-based reward: A straightforward choice is to calculate the reward based on the distance d of the vehicle’s geometric center to the center of the right-hand lane”, which discloses a first reward basis that provides a reward when a distance (from a vehicle to the center of the right hand lane) from the subject or vehicle to the target or center of right hand lane is less than a first threshold distance; and Page 247, Column 2; “within the distance dS to the center line of the lane”, which discloses the threshold distance)
the second reward basis comprising providing a bonus reward when the distance from the subject device to the target device is less than a second threshold distance, (Page 247, Column 2; “To improve the driving experience on straight lane sections and reduce wiggling, we add an additional reward, if the agent chooses to drive straight on those sections while being within the distance dS to the center line of the lane and the current φ is below the threshold φS (defined as S). In our experiments we set dS = 0.16 and φS = 0.02”, the additional reward being, under a broadest reasonable interpretation of the claim language in view of no explicit definition in the specification as to what constitutes a “distance”, a bonus award when the distance from the subject device or vehicle is less than a second threshold angular distance (φS = 0.02)
wherein the first threshold distance is greater than the second threshold distance (Page 247, Column 2; “dS = 0.16 and φS = 0.02”, where “dS = 0.16 (first threshold distance) is greater than φS = 0.02 (second threshold distance)).
Shalev-Schwartz, Englard, and Wolf are analogous art because all are concerned with intelligent vehicle navigation.  Before the effective filing date of the claimed invention, it would have been obvious to one skilled in intelligent vehicle navigation to combine the first and second rewards and distance thresholds of Wolf with the method of Shalev-Schwartz and Englard to yield the predictable result of wherein the proximity of the subject device to the target comprises a first and a second reward basis, the first reward basis comprising providing a reward when a distance from the subject device to the target is less than a first threshold distance, and the second reward basis comprising providing a bonus reward when the distance from the subject device to the target device is less than a second threshold distance, wherein the first threshold distance is greater than the second threshold distance. The motivation for doing so would be to improve the overall driving behavior of the vehicle agent (Wolf; Abstract).

Claim 7 is rejected under 35 U.S.C. § 103 as being obvious over Shalev-Schwartz in view of Englard and Achiam et al. (Achiam et al., “Constrained Policy Optimization”, 2017, Proceedings of the 34 the International Conference on Machine Learning, pp. 1-10, hereinafter “Achiam”).

Regarding claim 7, the rejection of claim1 is incorporated but Shalev-Schwartz fails to explicitly disclose wherein the subject device is a processor-based robot.
Achiam discloses wherein the subject device is a processor-based robot (Page 2, Column 1; “In our experiments, we show that CPO can train neural network policies with thousands of parameters on high dimensional simulated robot locomotion tasks to maximize rewards while successfully enforcing constraints”, which discloses that the subject device is a processor-based robot; and Page 6, Column 2; “We consider two tasks, and train multiple different agents (robots) for each task”).
Shalev-Schwartz, Englard, and Achiam are analogous art because all are concerned with intelligent vehicle navigation.  Before the effective filing date of the claimed invention, it would have been obvious to one skilled in intelligent vehicle navigation to combine the robot of Achiam with the method of Shalev-Schwartz and Englard to yield the predictable result of wherein the subject device is a processor-based robot. The motivation for doing so would be to improve robot locomotion tasks where the agent must satisfy constraints motivated by safety (Achiam; Abstract).


Claim 11 is rejected under 35 U.S.C. § 103 as being obvious over Shalev-Schwartz in view of Englard and Ozaki et al. (US 20180056520 A1, hereinafter “Ozaki”).

Regarding claim 11, the rejection of claim 1 is incorporated and Shalev-Schwartz further discloses provide a negative reward when a collision occurs between the subject device and target ([0221]; “Normally, in an accident free trajectory, R(s) would reward successful, smooth, takeovers and penalize staying in a lane without completing the takeover--hence the range [-1, 1]. If a sequence, S, represents an accident, the reward, -r, should provide a sufficiently high penalty to discourage such an occurrence”, which discloses a negative reward when a collision occurs).
Shalev-Schwartz fails to explicitly disclose interrupting the action performed by the subject device.
Ozaki discloses interrupting the action performed by the subject device . . . when a collision occurs between the subject device and target ([0016]; “The robot control unit may stop the robot when the tactile sensor has detected a slight collision. The machine learning device may be set in such a way as to stop performing further learning of a motion that has been learned by a certain point in time”).
Shalev-Schwartz, Englard, and Ozaki are analogous art because all are concerned with intelligent object navigation.  Before the effective filing date of the claimed invention, it would have been obvious to one skilled in intelligent object navigation to combine the action interrupt of Ozaki with the method of Shalev-Schwartz and Englard to yield the predictable result of interrupting the action performed by the subject device . . . when a collision occurs between the subject device and target. The motivation for doing so would be to stop performing further learning of a motion that has been learned by a certain point in time (Ozaki; [0016]).


Response to Arguments

Applicant’s arguments and amendments, filed on 12/13/2021, with respect to the 35 USC § 103 rejection of independent claims 1, 12, and 20 have been considered but are but are moot because the arguments do not apply to any of the references being used in the current rejection to reject independent claims 1, 12, and 20. Shalev-Schwartz and Englard are now being used to render claims 1, 12, and 20 obvious under 35 USC § 103.
Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Brent Hoover whose telephone number is (303)297-4403.  The examiner can normally be reached on Monday - Friday 9-5 MST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Abdullah Kawsar can be reached on 571-270-3169.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
 
/BRENT JOHNSTON HOOVER/Examiner, Art Unit 2127