DETAILED ACTION
This Non-Final Office Action is in response to claims filed 2/20/2020.
Claims 1-20 are pending.
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 3/26/2020 has been considered by the examiner.
Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 3, 4, and 10-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to a judicial exception (i.e., a law of nature, a natural phenomenon, or an abstract idea) without significantly more.
On January 7, 2019, the USPTO released new examination guidelines for determining whether a claim is directed to non-statutory subject matter.  According to the guidelines, a claim is directed to non-statutory subject matter if: (a) it does not fall within one of the four statutory categories of invention or (b) or meets a three-prong test for determining that: (1) the claim recites a judicial exception, e.g. an abstract idea, (2) without integration into a practical application and (3) does not recite additional elements that provide significantly more than the recited judicial exception.
Claims 3 and 20 are directed towards a method and system. Therefore, it can be seen that they fall within one of the four statutory categories of invention.  However, the claims clearly do not meet the three-prong test for patentability.

	With regard to the first prong, does the claim recite a judicial exception, the guidelines provide three groupings of subject matter that are considered abstract ideas: 
Mathematical concepts – mathematical relationships, mathematical formulas or equations, mathematical calculations;
Certain methods of organizing human activity – fundamental economic principles or practices (including hedging, insurance, mitigating risk); commercial or legal interactions (including agreements in the form of contracts; legal obligations; advertising, marketing or sales activities or behaviors; business relations); managing personal behavior or relationships or interactions between people (including social activities, teaching, and following rules or instructions); and
Mental processes – concepts performed in the human mind (including an observation, evaluation, judgment, opinion).
Applicant’s claims 3 and 20 are directed toward the abstract idea of updating current parameter values based on a difference between a first estimated return and a second estimated return, which comprises mental processes. Specifically, the limitation of receiving data indicative of one or more experience tuples each comprising a first observation characterizing a first state of an environment, a first action in dependence on the first observation, a reward associated with the performance of the first action, and a second observation characterizing a second state of the environment following the performance of the first action, as drafted, is a process that, under its broadest reasonable interpretation covers performance of the limitation in the bind but for the recitation of generic computer components. That is, other than reciting “by a control agent,” nothing in the claim element precludes the step from practically being performed in the mind. For example, but for the “by a control agent” language, “receiving” in the context of this claim encompasses the user manually interpreting or visually experiencing the state of an environment, an action, a reward, and a following action. 
Similarly, the limitations of processing the first observation to determine a first estimated return for the first action following the first observation and processing the second observation to determine a set of candidate estimated returns, where a greatest of the set of candidate estimated returns is determined, as drafted, is a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components. For example, but for the “using a value estimator” and “using a target value estimator” language, “processing” in the context of this claim encompasses the user thinking of values for the actions, so as to mentally evaluate the greatest value. 
The limitation of determining a terminal reward associated with a triggering of a failure condition in the second state of the environment, as drafted, is a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components. For example, “determining” in the context of this claim encompasses the user thinking of a value when visually or physically detecting a condition that may be interpreted as a failure. 
The limitations of determining a second estimated return for the first action following the first observation and updating the current parameter values in dependence upon a difference between the first estimated return and the second estimated return, as drafted, is a process that, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components. For example, but for the “by an adversarial stopping agent” language, “determining” and “updating” in the context of this claim encompasses the user thinking about a value and comparing it to a previous value for updating another value.
If a claim limitation, under its broadest reasonable interpretation, covers performance of the limitation in the mind but for the recitation of generic computer components, then it falls within the “Mental Processes” grouping of abstract ideas. Accordingly, the claim recites an abstract idea.

With regard to the second prong, whether the abstract idea is integrated into a practical application, the guidelines provide the following exemplary considerations that are indicative that an additional element (or combination of elements) may have integrated the judicial exception into a practical application:
an additional element reflects an improvement in the functioning of a computer, or an improvement to other technology or technical field;
an additional element that applies or uses a judicial exception to effect a particular treatment or prophylaxis for a disease or medical condition; 
an additional element implements a judicial exception with, or uses a judicial exception in conjunction with, a particular machine or manufacture that is integral to the claim;
an additional element effects a transformation or reduction of a particular article to a different state or thing; and
an additional element applies or uses the judicial exception in some other meaningful way beyond generally linking the use of the judicial exception to a particular technological environment, such that the claim as a whole is more than a drafting effort designed to monopolize the exception.
It is clear that Applicant’s claims do not comprise any of the above additional elements that, individually or in combination, have integrated the judicial exception into a practical application. There is no improvement in the functioning of a computer, nor are the limitations implemented in particular machine or manufacture. Although the claim recites the use of agents and estimators, these components are merely recited as generic computer components. There is no transformation or reduction of a particular article to a different state or thing. There are no additional elements that apply or use the judicial exception in some other meaningful way beyond generally linking the use of the judicial exception to a particular technological environment. 
While the guidelines further state that the exemplary considerations are not an exhaustive list and that there may be other examples of integrating the exception into a practical application, the guidelines also list examples in which a judicial exception has not been integrated into a practical application:
an additional element merely recites the words “apply it” (or an equivalent) with the judicial exception, or merely includes instructions to implement an abstract idea on a computer, or merely uses a computer as a tool to perform an abstract idea; 
an additional element adds insignificant extra-solution activity to the judicial exception; and 
an additional element does no more than generally link the use of a judicial exception to a particular technological environment or field of use.
Since the abstract idea in Applicant’s claims 3 and 20 are implemented on a computer and there are no further limitations or structural elements that go beyond the computer, it can clearly be seen that the abstract idea of updating current parameter values is merely implemented on a computer. Thus, there is no integration of the abstract idea into a practical application.

With regard to the third prong, whether the claims recite additional elements that provide significantly more than the recited judicial exception, the guidelines specify that the pre-guideline procedure is still in effect. Specifically, that examiners should continue to consider whether an additional element or combination of elements:
adds a specific limitation or combination of limitations that are not well-understood, routine, conventional activity in the field, which is indicative that an inventive concept may be present; or  
simply appends well-understood, routine, conventional activities previously known to the industry, specified at a high level of generality, to the judicial exception, which is indicative that an inventive concept may not be present.
Applicant’s claims do not recite additional elements that provide significantly more than the recited judicial exception. Examiner takes official notice that the use of one or more computers to implement mental processes is a well-understood, routine and conventional activity.
Thus, since claims 3 and 20 are: (a) directed toward an abstract idea, (b) not integrated into a practical application and (c) do not comprise significantly more than the recited abstract idea, they are directed toward non-statutory subject matter.
	Claims 4 and 10-19 do not comprise any further limitations which cause the abstract idea to be integrated into a practical application or recite significantly more than the abstract idea. Therefore, claims 4 and 10-19 are also rejected under 35 USC 101.
Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 3, 5-7, 9, 14, 16, and 18-20 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Zhou et al. (“Deep Reinforcement Learning Based Intelligent Making for Two-player Sequential Game with Uncertain Irrational Player,” 2019, IEEE), hereinafter Zhou.
Claim 3
Zhou discloses the claimed computer-implemented method comprising: 
receiving data indicative of one or more experience tuples each comprising a first observation characterizing a first state of an environment (i.e. co-player’s current state                                 
                                    
                                        
                                            s
                                        
                                        -
                                    
                                
                            ), a first action (i.e. action a) performed by a control agent (i.e. robot) in dependence on the first observation, a reward (i.e. reward r) associated with the performance of the first action, and a second observation characterizing a second state of the environment following the performance of the first action (i.e. co-player’s next state                                 
                                    
                                        
                                            s
                                        
                                        -
                                    
                                
                            ’) (see at least pages 10-11, sections II(C) and III(A), regarding the conventions with respect to the “experience tuples,” where the actions and rewards of the robot are associated with the co-player/second robot’s state),
for each of the one or more experience tuples: 
processing the first observation, using a value estimator with current parameter values, to determine a first estimated return for the first action following the first observation (see page 11, Algorithm 1, with respect to Qr (si, a1,i, a2,i, μi; θr), where θr represents the parameter of the optimal Q value estimator, as described in section III(A));
processing the second observation, using a target value estimator with an identical architecture to the value estimator, to determine a set of candidate estimated returns, each of the set of candidate estimated returns corresponding to a respective one of a set of candidate second actions following the second observation (see page 11, Algorithm 1, with respect to Qr (                                
                                    
                                        
                                            s
                                        
                                        -
                                    
                                
                            ’, a1’, a2’, μi; θr) being determined for every                                 
                                    
                                        
                                            s
                                        
                                        -
                                    
                                
                            i,                                 
                                    
                                        
                                            s
                                        
                                        -
                                    
                                
                            i’, ri, a1,i, a2,i, μi);
determining a greatest of the set of candidate estimated returns (see page 11, Algorithm 1, with emphasis on lines 18-19, where the maximum or maxmin of Qr (                                
                                    
                                        
                                            s
                                        
                                        -
                                    
                                
                            ’, a1’, a2’, μi; θr) is determined depending on the competitive flag); Maxmin is the highest value that the player can get without knowing the actions of the other player.
determining a terminal reward associated with a triggering of a failure condition in the second state of the environment (see page 11, section III(A), regarding the cooperation index μ and competitive flag ξ are determined, where a competitive behavior is a result of a mechanical failure or sensor error, as described in definition 1 on page 12); 
determining, using the determined terminal reward and the greatest candidate estimated return, a second estimated return for the first action following the first observation (see page 11, Algorithm 1, with respect to ri + V’(                                
                                    
                                        
                                            s
                                        
                                        -
                                    
                                
                            ’, μi; θr), where the cooperation index μ and competitive flag ξ are used to determine V’(                                
                                    
                                        
                                            s
                                        
                                        -
                                    
                                
                            ’, μi; θr) and V’(                                
                                    
                                        
                                            s
                                        
                                        -
                                    
                                
                            ’, μi; θr) is associated with the maximum Qr (                                
                                    
                                        
                                            s
                                        
                                        -
                                    
                                
                            ’, a1, a2, μi; θr), as further described in section III(A)), accounting for an intervention by an adversarial stopping agent arranged to trigger the failure condition when predetermined stopping criteria are satisfied (see definition 1 on page 12, regarding that competitive behavior occurs due to mechanical failure or sensor error, such that cooperation index μ and competitive flag ξ are provided to modify V’(                                
                                    
                                        
                                            s
                                        
                                        -
                                    
                                
                            ’, μi; θr) in the case of a failure, as described on page 11, section III(A) and shown in Algorithm 1); and 
updating the current parameter values of the value estimator in dependence upon a difference between the first estimated return and the second estimated return (see page 11, Algorithm 1, with emphasis on lines 19-20, regarding updating θr of the Q value based on the loss function, where the loss function is defined by ri + V’(                                
                                    
                                        
                                            s
                                        
                                        -
                                    
                                
                            ’, μi; θr) - Qr (si, a1,i, a2,i, μi; θr), as additionally described in section III(A)),
wherein, after being sequentially updated in accordance with each of the one or more experience tuples, the current parameter values of the value estimator are trained parameter values (see page 11, Algorithm 1, depicting the sequential update of θr of the optimal Q value estimator indicated by the while and for loops).
The “first estimated return” and “second estimated return” are interpreted in light of equation (3) provided in the specification filed 2/20/2020.
Claim 5
Zhou further discloses that - 18 -PATENTthe environment is a physical environment and for each of the one or more experience tuples, the first and second observations are made using one or more sensors (see page 14, section C, with respect to Figure 5 and 6, regarding the robots and their OptiTrack motion capture systems).
Claim 6
Zhou further discloses that for each of the one or more experience tuples, the first action is performed using one or more actuators (see page 14, sections B and C, regarding the control of the robots to travel).
Claim 7
Zhou further discloses that 
the control agent is arranged to control an autonomous vehicle; 
the second observation characterizing a second state of the environment is indicative of a current location of the autonomous vehicle; 
the failure condition corresponds to a mechanical failure of a physical component of the autonomous vehicle; and 
the terminal reward associated with the triggering of the failure condition in the second state of the environment depends on the indicated current location of the autonomous vehicle, as described in the rejection of claim 3, where the disclosure of Zhou pertains to cooperative autonomous robots (see abstract) that travel on a planned path and detect the positions of one another, as described in section IV(A) on page 13. It is clear that a failure condition discussed in the rejection of claim 3 would adjust the cooperation index and competitive flag associated with the experimental setup of Zhou.
Claim 9
Zhou further discloses that the control agent is arranged to determine a route for the autonomous vehicle (see page 13, section IV(A), regarding the path planning used by the robots).
Claim 14
Zhou further discloses that the value estimator and the target value estimator are identical, as described in the rejection of claim 3.  
Claim 16
Zhou further discloses that the value estimator comprises a deep neural network with a given architecture; and the target value estimator comprises a deep neural network with the same architecture as the value estimator (see section III(A) on page 11, regarding the Q value as associated with deep neural network Qr). 
Claim 18  
Zhou further discloses
receiving data indicative of a third observation characterizing a third state of the environment; 
processing the third observation, using the value estimator with the trained parameter values, to determine a candidate estimated return for the third observation and each of a set of candidate third actions; and  - 20 -PATENT 
determining a best action as the candidate third action determined to have the greatest candidate estimated return (see page 11, Algorithm 1, with respect to selecting action a1 associated with the maximum Qr (                                
                                    
                                        
                                            s
                                        
                                        -
                                    
                                
                            , a1’, a2’; θr), which is repeated in the while loop). It is clear that a second iteration of the while loop of Algorithm 1 would provide for a “third observation” with a “candidate third action.”
Claim 19
Zhou further discloses generating further data indicative of a further experience tuple for further training of the value estimator, wherein generating the further experience tuple comprises: 
selecting a third action to be performed by the control agent in dependence on the third observation; and 
receiving data indicative of a reward associated with the performance of the third action and a fourth observation characterizing a fourth state of the environment following the performance of the third action, 
wherein selecting the third action comprises selecting randomly from the set of candidate third actions with a predetermined probability between zero and one, and otherwise selecting the determined best action (see page 11, Algorithm 1, with respect to selecting action a1 associated with the maximum Qr (                                
                                    
                                        
                                            s
                                        
                                        -
                                    
                                
                            , a1’, a2’; θr), which is repeated in the while loop). It is clear that a third iteration of the while loop of Algorithm 1 would provide for a “third action” and a “fourth observation.”
Claim 20
Zhou discloses the claimed data processing system, as described in the rejection of claim 3.
Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1 and 8 are rejected under 35 U.S.C. 103 as being unpatentable over Zhou in view of Official Notice.
Claim 1
Zhou discloses the computer-implemented method, as described in the rejection of claim 3. Zhou further discloses the autonomous vehicle as a ground vehicle (see Figure 5 and 6) and does not disclose embodiments in which the autonomous vehicle is a UAV. However, modifying the autonomous vehicle of Zhou to instead be a UAV would be capable of instant and unquestionable demonstration, given that the experiment conducted in Zhou is to shield the autonomous vehicle from strong winds (see page 13, section IV(A)), which would also be applicable to UAVs, since it is well known in the art that wind affects the flight of UAVs. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the autonomous vehicle of Zhou to be a UAV, in light of Official Notice, with the predictable result of providing a known alternative vehicle that would also benefit from being shielded from strong winds.
Claim 8
Zhou discloses the autonomous vehicle as a ground vehicle (see Figure 5 and 6) and does not disclose embodiments in which the autonomous vehicle is a UAV. However, modifying the autonomous vehicle of Zhou to instead be a UAV would be capable of instant and unquestionable demonstration, given that the experiment conducted in Zhou is to shield the autonomous vehicle from strong winds (see page 13, section IV(A)), which would also be applicable to UAVs, since it is well known in the art that wind affects the flight of UAVs. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the autonomous vehicle of Zhou to be a UAV, in light of Official Notice, with the predictable result of providing a known alternative vehicle that would also benefit from being shielded from strong winds.
Claim 17 is rejected under 35 U.S.C. 103 as being unpatentable over Zhou in view of Sharma (“A Survey of Reinforcement Learning Techniques,” March 2019, ResearchGate), hereinafter Sharma.
Claim 17
Zhou discloses the value estimator and target value estimator as Deep Q-networks (see page 11, section III(A)) and does not specifically disclose that the value estimator comprises a linear combination of predetermined basis functions and the target value estimator comprises a linear combination of the same predetermined basis functions as the value estimator. However, the Deep Q-learning taught by Zhou is a known type of reinforcement learning and may be reasonably modified to use other types of reinforcement learning, such as Least-Squares Policy Iteration. 
Specifically, Sharma discloses a summary of the different types of reinforcement learning techniques, including Deep Q-learning (see page 10, with respect to section 5 Q-learning), where LSPI is similar to Q-learning (see first paragraph of section 6 on page 11) and comprises a linear combination of predetermined basis functions (see third paragraph on page 12). Therefore, it would be reasonable to modify the value estimator and target value estimator of Zhou to be LSPI that comprise a linear combination of predetermined basis functions.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified the value estimator and target value estimator of Zhou, so as to comprise a linear combination of predetermined basis functions, in light of Sharma, with the predictable result of providing transparent and easy to implement algorithms (third paragraph on page 12 of Sharma).
Allowable Subject Matter
Claim 2 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
With respect to claim 2, the closest prior art of record, Zhou, taken alone or in combination, does not teach that the claimed terminal reward is determined in dependence on location data indicating the location of the UAV with respect to a predetermined map when the failure condition is triggered, in light of the overall claim. No reasonable combination of prior art can be made to teach this claimed feature, in light of the overall claim.
Claims 4, 10-13, and 15 are objected to, but would be allowable if rewritten or amended to overcome the rejection(s) under 35 U.S.C. 101 set forth in this Office action.
With respect to claim 4, the closest prior art of record, Zhou, taken alone or in combination, does not teach that the claimed predetermined criteria for triggering the failure condition include the determined terminal reward being lower than the second estimated return for the first observation and the first action, in light of the overall claim. Specifically, Zhou discloses the predetermined criteria for triggering the failure condition as associated with cooperation index and competitive flag and does not specifically compare the second estimated return for the first observation and the first action to the determined terminal reward, so as to trigger a failure. No reasonable combination of prior art can be made to teach this claimed feature, in light of the overall claim.
With respect to claim 10, the closest prior art of record, Zhou, taken alone or in combination, does not teach that 
the environment is a physical environment; 
the control agent is arranged to control a physical entity in the physical environment, the physical entity having a plurality of physical components; 
the failure condition corresponds to a failure one of the physical components, resulting in a reduced set of actions being available to the control agent; and 
the terminal reward associated the triggering of the failure condition in the second state comprises an estimated return for the second observation taking into account the reduced set of actions available to the control agent, in light of the overall claim. 
No reasonable combination of prior art can be made to teach this claimed feature, in light of the overall claim.
With respect to claim 15, the closest prior art of record, Zhou, taken alone or in combination, does not teach updating parameter values of the target value estimator to match the current parameter values of the value estimator after a predetermined number of updates of the current parameter values of the value estimator, in light of the overall claim. No reasonable combination of prior art can be made to teach this claimed feature, in light of the overall claim.
Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Sara Lewandroski whose telephone number is (571)270-7766. The examiner can normally be reached Monday-Friday, 9 am-5 pm ET.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Elaine Gort can be reached on (571)272-6781. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.




/SARA J LEWANDROSKI/Examiner, Art Unit 3661                                                                                                                                                                                                        

/RUSSELL FREJD/Primary Examiner, Art Unit 3661