DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Status of the Claims
This action is in response to the applicant’s amendment of July 11, 2022. 
Claims 1-2, 4-7, and 9-13 are pending and have been considered as follows.

Response to Arguments
Applicant’s arguments/amendments with respect to the rejections of claims under 35 USC §112(b) and 35 USC §112(d) have been fully considered and are persuasive.  Therefore, the rejections of claims under 35 USC §112(b) and 35 USC §112(d) have been withdrawn.
Applicant’s arguments/amendments with respect to the rejection of claims under 35 USC §101 have been fully considered and are persuasive.  Therefore, the rejection of claims under 35 USC §101 has been withdrawn.

EXAMINER’S AMENDMENT
An examiner’s amendment to the record appears below. Should the changes and/or additions be unacceptable to applicant, an amendment may be filed as provided by 37 CFR 1.312. To ensure consideration of such an amendment, it MUST be submitted no later than the payment of the issue fee.
Authorization for this examiner’s amendment was given in an interview with Greg Turocy and Sunil Colaco on July 21, 2022.
The application has been amended as follows: 
Please amend the claims as follows: 
1. (Currently Amended) A learning device comprising: 
a processor configured to: 
generate first information based on a traveling target and information representing a surrounding object, the first information being a target trajectory along which a vehicle will automatically travel in the future, the target trajectory including a plurality of trajectory points; 
derive a plurality of individual rewards obtained by evaluating each of a plurality of pieces of information, wherein the plurality of pieces of information represent feedback information obtained from a simulator or an actual environment by inputting second information based on the plurality of trajectory points to the simulator or the actual environment, and derive a reward of the plurality of individual rewards for performing [[the]]an action of the vehicle; 
perform reinforcement learning that optimizes the reward of the plurality of individual rewards based on the action of the vehicle; and 
derive the individual reward based on an application of a plurality of reward functions of which distribution shapes of the individual rewards for a relation to a target value are different from each other to at least a part of the plurality of pieces of information to be evaluated, 
wherein the plurality of reward functions include a reward function that returns a predetermined value in a case in which an input value matches the target value and returns a smaller value as an absolute value of a difference between the input value and the target value increases, however, a degree to which the individual reward for a difference between the input value on a side where the input value exceeds the target value is reduced and the target value is greater than a degree to which the individual reward for a difference between the input value on a side where the input value is less than the target value and the target value is reduced.

9. (Currently Amended) A learning device comprising: 
a processor configured to: 
generate first information based on a traveling target and information representing a surrounding object, the first information being a target trajectory along which a vehicle will automatically travel in the future, the target trajectory including a plurality of trajectory points; 
derive a plurality of individual rewards obtained by evaluating each of a plurality of pieces of information, wherein the plurality of pieces of information represent feedback information obtained from a simulator or an actual environment by inputting second information based on the plurality of trajectory points to the simulator or the actual environment, and derive a reward of the plurality of individual rewards for performing [[the]]an action of the vehicle; 
perform reinforcement learning that optimizes the reward of the plurality of individual rewards based on the action of the vehicle; and 
derive the individual reward based on an application of a plurality of reward functions of which distribution shapes of the individual rewards for a relation to a target value are different from each other to at least a part of the plurality of pieces of information to be evaluated, 
wherein the plurality of reward functions include a reward function that returns a predetermined value in a case in which an input value is equal to or greater than the target value and returns a smaller value as an absolute value of a difference between the input value and the target value increases in a case in which the input value is less than the target value.

10. (Currently Amended) A learning device comprising: 
a processor configured to: 
generate first information based on a traveling target and information representing a surrounding object, the first information being a target trajectory along which a vehicle will automatically travel in the future, the target trajectory including a plurality of trajectory points; 
derive a plurality of individual rewards obtained by evaluating each of a plurality of pieces of information, wherein the plurality of pieces of information represent feedback information obtained from a simulator or an actual environment by inputting second information based on the plurality of trajectory points to the simulator or the actual environment, and derive a reward of the plurality of individual rewards for performing [[the]]an action of the vehicle; 
perform reinforcement learning that optimizes the reward of the plurality of individual rewards based on the action of the vehicle; and 
derive the individual reward based on an application of a plurality of reward functions of which distribution shapes of the individual rewards for a relation to a target value are different from each other to at least a part of the plurality of pieces of information to be evaluated, 
wherein the plurality of reward functions include a reward function that returns a predetermined value in a case in which an input value is equal to or less than the target value and returns a smaller value as an absolute value of a difference between the input value and the target value increases in a case in which the input value is greater than the target value.

11. (Currently Amended) A learning device comprising: 
a processor configured to: 
generate first information based on a traveling target and information representing a surrounding object, the first information being a target trajectory along which a vehicle will automatically travel in the future, the target trajectory including a plurality of trajectory points; 
derive a plurality of individual rewards obtained by evaluating each of a plurality of pieces of information, wherein the plurality of pieces of information represent feedback information obtained from a simulator or an actual environment by inputting second information based on the plurality of trajectory points to the simulator or the actual environment, and derive a reward of the plurality of individual rewards for performing [[the]]an action of the vehicle; 
perform reinforcement learning that optimizes the reward of the plurality of individual rewards based on the action of the vehicle; and 
derive the individual reward based on an application of a plurality of reward functions of which distribution shapes of the individual rewards for a relation to a target value are different from each other to at least a part of the plurality of pieces of information to be evaluated, 
wherein the plurality of reward functions include a reward function that returns an example of a predetermined value in a case in which an input value is within a target range and returns a smaller value as an absolute value of a difference between the input value and an upper limit or a lower limit of the target range increases.

12. (Currently Amended) The learning device according to claim 1, 
wherein the plurality of reward functions include the reward function that returns a larger value as the input value approaches any of two or more target values.

Allowable Subject Matter
Claims 1-2, 4-7, and 9-13 are pending and allowed.
The following is an examiner’s statement of reasons for allowance: 
The closest prior art of Redding et al. (US 2018/0089563 A1) teaches a behavior planner for a vehicle generates a plurality of conditional action sequences of the vehicle using a tree search algorithm and heuristics obtained from one or more machine learning models. Each sequence corresponds to a sequence of anticipated states of the vehicle. At least some of the action sequences are provided to a motion selector of the vehicle. The motion selector generates motion-control directives based on the received conditional action sequences and on data received from one or more sensors of the vehicle, and transmits the directives to control subsystems of the vehicle.
Further, KIMURA (US 2021/0018882 A1) teaches an information processing device including an action value calculation unit configured to calculate an action value that determines behavior of an operation unit, and the action value calculation unit dynamically calculates, based on an acquired purpose change factor and a plurality of first action values learned based on rewards different from each other, a second action value to be input to the operation unit. In addition, provided is an information processing device including a feedback unit configured to determine, based on an operation result of an operation unit that performs dynamic behavior based on a plurality of action values learned based on rewards different from each other, excess and insufficiency related to the action values, and control information notification related to the excess and insufficiency.
Furthermore, Yao et al. (US 2019/0101917 A1) teaches a method, device and system of prediction of a state of an object in the environment using an action model of a neural network. In accordance with one aspect, a control system for an object comprises a processor, a plurality of sensors coupled to the processor for sensing a current state of the object and an environment in which the object is located, and a first neural network coupled to the processor. A plurality of predicted subsequent states of the object in the environment is obtained using an action model, a current state of the object in the environment and a plurality of actions. The action model maps a plurality of states of the object in the environment and a plurality of actions performed by the object for each state to predicted subsequent states of the object in the environment. An action that maximizes a value of a target is determined. The target is based at least on a reward for each of the predicted subsequent states. The determined action is performed. 
In regards to independent claims 1 and 9-11, Redding et al. (US 2018/0089563 A1), KIMURA (US 2021/0018882 A1), and Yao et al. (US 2019/0101917 A1), taken either individually or in combination with each other or other prior art of record, fails to teach or render obvious, in the context of the remaining limitations of the claim(s): 

(With regard to claim 1 and a learning device)

derive the individual reward based on an application of a plurality of reward functions of which distribution shapes of the individual rewards for a relation to a target value are different from each other to at least a part of the plurality of pieces of information to be evaluated, 
wherein the plurality of reward functions include a reward function that returns a predetermined value in a case in which an input value matches the target value and returns a smaller value as an absolute value of a difference between the input value and the target value increases, however, a degree to which the individual reward for a difference between the input value on a side where the input value exceeds the target value is reduced and the target value is greater than a degree to which the individual reward for a difference between the input value on a side where the input value is less than the target value and the target value is reduced. (emphasis added)

(With regard to claim 9 and a learning device)

derive the individual reward based on an application of a plurality of reward functions of which distribution shapes of the individual rewards for a relation to a target value are different from each other to at least a part of the plurality of pieces of information to be evaluated, 
wherein the plurality of reward functions include a reward function that returns a predetermined value in a case in which an input value is equal to or greater than the target value and returns a smaller value as an absolute value of a difference between the input value and the target value increases in a case in which the input value is less than the target value. (emphasis added)

(With regard to claim 10 and a learning device)

derive the individual reward based on an application of a plurality of reward functions of which distribution shapes of the individual rewards for a relation to a target value are different from each other to at least a part of the plurality of pieces of information to be evaluated, 
wherein the plurality of reward functions include a reward function that returns a predetermined value in a case in which an input value is equal to or less than the target value and returns a smaller value as an absolute value of a difference between the input value and the target value increases in a case in which the input value is greater than the target value. (emphasis added)

(With regard to claim 11 and a learning device)

derive the individual reward based on an application of a plurality of reward functions of which distribution shapes of the individual rewards for a relation to a target value are different from each other to at least a part of the plurality of pieces of information to be evaluated, 
wherein the plurality of reward functions include a reward function that returns an example of a predetermined value in a case in which an input value is within a target range and returns a smaller value as an absolute value of a difference between the input value and an upper limit or a lower limit of the target range increases. (emphasis added)


Any comments considered necessary by applicant must be submitted no later than the payment of the issue fee and, to avoid processing delays, should preferably accompany the issue fee.  Such submissions should be clearly labeled “Comments on Statement of Reasons for Allowance.”

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to KYLE S. PARK whose telephone number is (571)272-3151. The examiner can normally be reached Mon-Thurs 8:00AM-5:00PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Anne M ANTONUCCI can be reached on (313)446-6519. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/K.S.P./Examiner, Art Unit 3666      

/ANNE MARIE ANTONUCCI/Supervisory Patent Examiner, Art Unit 3666