DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Acknowledgment is made of applicant’s claim for foreign priority under 35 U.S.C. 119 (a)-(d). The certified copy has been filed in parent Application No.GR20170100448, filed on 10/04/2017.
Information Disclosure Statement
The information disclosure statements (IDS) submitted on 04/03/2020 and 03/05/2021 are in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Rejections - 35 USC § 102
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 19-21, 25-26, 28, 30-31, and 34-38 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Dabney et al., WO 2018/189404 A1(“Dabney”).
Regarding claim 19, Dabney teaches a machine learning system comprising a first subsystem and a second subsystem remote from the first subsystem(Dabney,  pg. 7, para. 0033, see also fig. 1, “FIG. 1 is a block diagram of an example reinforcement learning system 100. The reinforcement learning system 100 is an example of a system implemented as computer programs on one or more computers in one or more locations….”  ), the first subsystem comprising:
 a decision-making subsystem comprising one or more agents each arranged to receive state information indicative of a current state of an environment and to generate an action signal dependent on the received state information(Dabney, pg. 9, para. 0044, see also fig. 1, “The system 100 uses the distributional Q network 112 in selecting actions 102 to be performed by the agent 104 in response to observations 108 at each time step. In particular, at each time step, for each action from the set of actions that can be performed by the agent, the system 100 provides the action - current observation pair as an input to the distributional Q network 112. The distributional Q network 112 processes each action - current observation pair to generate outputs that define respective probability distributions 114 over the set of possible Q returns for each action - current observation pair.”)1
 and a policy associated with that agent, the action signal being configured to cause a change in a state of the environment(Dabney, pgs. 9-10, para. 0047, see also fig. 1, “The system 100 selects an action 102 to be performed by the agent 104 at the time step based on the measures of central tendency 122 corresponding to the actions. In some implementations, the system 100 selects an action having a highest corresponding measure of central tendency 122 from amongst all the actions in the set of actions that can be performed by the agent 104. In some implementations, the system 100 selects an action in accordance with an exploration strategy. For example, the system 100 may use an e-greedy exploration strategy. In this example, the system 100 may select an action having a highest corresponding measure of central tendency with probability 1 —e, and select an action randomly with probability e, where e is a number between 0 and 1.”), 
each agent further arranged to generate experience data dependent on the received state information and information conveyed by the action signal(Dabney, pg. 14, para. 0062, see also fig. 1, “The system obtains an experience tuple (302). The experience tuple includes data indicating: (i) a current training observation, (ii) a current action performed by the agent in response to the training observation, (iii) a current reward received in response to the agent performing the action, and (iv) a next training observation characterizing a state that the environment transitioned into as a result of the agent performing the action. The experience tuple may be an online experience tuple or an offline experience tuple. An online experience tuple refers to an experience tuple where the action included in the experience tuple was selected based on outputs generated by the distributional Q network in accordance with current values of distributional Q network parameters (e.g., as described with reference to FIG. 2). An offline experience tuple refers to an experience tuple where the action included in the experience tuple was selected based on any appropriate action selection policy (e.g., a random action selection policy).”); 
a first network interface configured to send experience data to the second subsystem and to receive policy data from the second subsystem, and the second subsystem comprising: a second network interface configured to receive experience data from the first subsystem and send policy data to the first subsystem;  and a computer-implemented policy learner configured to process said received experience data to generate said policy data, dependent on the experience data, for updating one or more policies associated with the one or more agents, wherein the decision-making subsystem is configured to update the policies associated with the one or more agents in accordance with policy data received from the second subsystem(Dabney, pg. 10, para. 0049, see also fig. 1, “The training engine 124 trains the distributional Q network 112 based on training data including a set of multiple experience tuples 126. Each experience tuple includes data indicating: (i) a training observation, (ii) an action performed by the agent in response to the training observation, (iii) a reward received in response to the agent performing the action, and (iv) a next training observation characterizing a state that the environment transitioned into as a result of the agent performing the action. The set of experience tuples 126 may include online experience tuples, offline experience tuples, or both. An online experience tuple refers to an experience tuple where the action included in the experience tuple was selected based on outputs generated by the distributional Q network 112 in accordance with current values of distributional Q network parameters. An offline experience tuple refers to an experience tuple where the action included in the experience tuple was selected based on any appropriate action selection policy (e.g., a random action selection policy).”).2  
Regarding claim 20, Dabney teaches the system of claim 19 wherein the sending of state information and action signals between the environment and the one or more agents is decoupled from the sending of experience data and policy data between the first subsystem and the second subsystem(Dabney, As fig. 1 details, the action signal that generates the action (102) for the Agent (104) in the Environment (106) and sends state information through the Observation (108) is decoupled from the experience tuples (126) that sends experience data to the second subsystem i.e. the training engine (124) and policy data to the first subsystem i.e. Distributional Q neural network (112)).3  
Regarding claim 21, Dabney teaches the system of claim 19 wherein: the first subsystem and the second subsystem are configured to communicate with one another via an application programming interface, API(Dabney, pg. 23, para. 0093, “The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.”); and the experience data sent from the first subsystem to the second subsystem has a format specified by the API(Dabney, pg. 14, para. 0062, “The system obtains an experience tuple (302). The experience tuple includes data indicating: (i) a current training observation, (ii) a current action performed by the agent in response to the training observation, (iii) a current reward received in response to the agent performing the action, and (iv) a next training observation characterizing a state that the environment transitioned into as a result of the agent performing the action.”).   
Regarding claim 25, Dabney teaches the system of claim 19 wherein at least one of the first subsystem and the second subsystem is implemented as a distributed computing system(Dabney, pg. 21, para. 0084, “[C]an be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a data communication network.”).  
Regarding claim 26, Dabney teaches the system of claim 19 further comprising a probabilistic model arranged to generate probabilistic data relating to future states of the environment, wherein the one or more agents is arranged to generate the action signal in dependence on the probabilistic data(Dabney, pg. 9, paras. 0044-0045, “In particular, at each time step, for each action from the set of actions that can be performed by the agent, the system 100 provides the action - current observation pair as an input to the distributional Q network 112. The distributional Q network 112 processes each action - current observation pair to generate outputs that define respective probability distributions 114 over the set of possible Q returns for each action - current observation pair. For example, the outputs of the distributional Q network 112 may define: (i) a first probability distribution 116 over the set of possible Q returns for a first action – current observation pair, (ii) a second probability distribution 118 over the set of possible Q returns for a second action - current observation pair, and (iii) a third probability distribution 120 over the set of possible Q returns for a third action - current observation pair.”).4 
Regarding claim 28, Dabney teaches the system of claim 26 further comprising a model learner configured to process model input data to generate the probabilistic model(Dabney, pg. 9, paras. 0044-0045, “In particular, at each time step, for each action from the set of actions that can be performed by the agent, the system 100 provides the action - current observation pair as an input to the distributional Q network 112. The distributional Q network 112 processes each action - current observation pair to generate outputs that define respective probability distributions 114 over the set of possible Q returns for each action - current observation pair. For example, the outputs of the distributional Q network 112 may define: (i) a first probability distribution 116 over the set of possible Q returns for a first action – current observation pair, (ii) a second probability distribution 118 over the set of possible Q returns for a second action - current observation pair, and (iii) a third probability distribution 120 over the set of possible Q returns for a third action - current observation pair.”).  
Regarding claim 30, Dabney teaches the system of claim 28 wherein the model learner is further configured to process the experience data generated by the one or more agents to update the probabilistic model(Dabney, pgs. 10-11, para. 0050, see also fig. 3, “[A]t each training iteration, the training engine 124 may obtain and process an experience tuple to determine, for each possible Q return, a numerical value referred to in this specification as a projected sample update for the Q return… [t]he loss function may, for example, encourage the distributional Q network 112 to generate an output (i.e., in response to processing the action - training observation pair) that defines a probability distribution where the probability value for each Q return is similar to the projected sample update for the Q return.”).  
Regarding claim 31, Dabney teaches the system of claim 28, wherein the model learner is incorporated within the second subsystem(Dabney, pg. 10, para. 0049, see also fig. 1, “The training engine 124 trains the distributional Q network 112 based on training data including a set of multiple experience tuples 126.”).  
Regarding claim 34, Dabney teaches the system of claim 26 wherein: the system is configured to generate simulation data using the probabilistic model, the simulation data comprising simulated states of the environment(Dabney, pg. 12, para. 0054, “The system receives a current observation characterizing a current state of the environment (202). The current observation may be generated by or derived from sensors of the agent. For example, the current observation may be captured by a camera of the agent. As another example, the current observation may be derived from data captured from a laser sensor of the agent. As another example, the current observation may be a hyperspectral image captured by a hyperspectral sensor of the agent.”); and the one or more agents are configured to generate experience data based on interactions between the one or more agents and the simulated states of the environment(Dabney, pg. 12, para. 0055, “For each action from a set of actions that can be performed by the agent, the system determines a corresponding probability distribution over a set of possible Q returns (204). More specifically, for each action, the system provides the action - current observation pair as an input to a distributional Q neural network.” & see also Dabney, pg. 14, para. 0062, “An online experience tuple refers to an experience tuple where the action included in the experience tuple was selected based on outputs generated by the distributional Q network in accordance with current values of distributional Q network parameters.”).5  
Regarding claim 35, Dabney teaches the system of claim 19, wherein the environment is a model of a physical system(Dabney, pgs. 5-6, para. 0026, “[T]he system as described in this specification may be used to select actions to be performed by a robotic agent interacting with a real-world environment. In these cases, the system as described in this specification may enable the robotic agent to achieve acceptable performance more quickly, to perform actions which more effectively accomplish tasks, and to more readily adapt to previously unseen environments, than if the actions to be performed by the robotic agent were selected by a conventional system. For example, the agent may be a robotic agent that performs tasks such as moving objects between locations (e.g., in a shipping warehouse), assembling components (e.g., electronic components in a manufacturing environment), or navigating between locations (e.g., as an autonomous or semi-autonomous vehicle).”).
Regarding claim 36, Dabney teaches the system of claim 28, wherein: the environment is a model of a physical system(Dabney, pgs. 5-6, para. 0026, “[T]he system as described in this specification may be used to select actions to be performed by a robotic agent interacting with a real-world environment.”); and  the model input data comprises measurements from one more sensors in the physical system(Dabney, pg. 12, para. 0054, “The system receives a current observation characterizing a current state of the environment (202). The current observation may be generated by or derived from sensors of the agent.”).  
Regarding claim 37, Dabney teaches the system of claim 35, wherein the one or more agents are associated with physical entities in the physical system(Dabney, pgs. 5-6, para. 0026, “[T]he system as described in this specification may be used to select actions to be performed by a robotic agent interacting with a real-world environment.”), and the second subsystem is configured to send signals to the physical entities corresponding to the action signals generated by the agents(Dabney, pgs. 10, para. 0048, “The system 100 includes a training engine 124 that is configured to train the distributional Q network 112 over multiple training iterations using reinforcement learning techniques. The training engine 124 trains the distributional Q network 112 by iteratively (i.e., at each training iteration) adjusting the current values of the distributional Q network parameters [i.e. send signals to the physical entities corresponding to the action signals generated by the agents]. By training the distributional Q network 112, the training engine 124 may, for example, cause the distributional Q network 112 to generate outputs that result in the selection of actions 102 to be performed by the agent 104….”).6  
Regarding claim 38, Dabney teaches the system of claim 37, wherein the second subsystem is configured to send control signals to the physical entities corresponding to the action signals generated by the agents(Dabney, pgs. 10, para. 0048, “The system 100 includes a training engine 124 that is configured to train the distributional Q network 112 over multiple training iterations using reinforcement learning techniques. The training engine 124 trains the distributional Q network 112 by iteratively (i.e., at each training iteration) adjusting the current values of the distributional Q network parameters [i.e. send control signals to the physical entities corresponding to the action signals generated by the agents]. By training the distributional Q network 112, the training engine 124 may, for example, cause the distributional Q network 112 to generate outputs that result in the selection of actions 102 to be performed by the agent 104….”).  

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.
The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 22-24 are rejected under 35 U.S.C. 103 as being unpatentable over Dabney et al., WO 2018/189404 A1(“Dabney”) in view of Nair, Arun, et al. "Massively parallel methods for deep reinforcement learning." arXiv preprint arXiv:1507.04296 (2015)(“Nair”).
Regarding claim 22, Dabney teaches the system of claim 19, but does not teach wherein the decision-making subsystem comprises a plurality of agents. 
However, Nair teaches: wherein the decision-making subsystem comprises a plurality of agents(Nair, pgs. 3-4,  right-col, see also fig. 2, “Any reinforcement learning agent must ultimately select actions at to apply in its environment. We refer to this process as acting. The Gorila architecture contains                         
                            
                                
                                    N
                                
                                
                                    a
                                    c
                                    t
                                
                            
                        
                     different actor processes, applied to                         
                            
                                
                                    N
                                
                                
                                    a
                                    c
                                    t
                                
                            
                        
                     corresponding instantiations of the same environment. Each actor i generates its own trajectories of experience…within the environment, and as a result each actor may visit different parts of the state space.”). 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Dabney with the teachings of Nair the motivation to do so would be to have multiple agents interact in a given environment to reduce overall training time(Nair, pg. 2, left-column, “We applied our distributed framework for RL, known as Gorila (General Reinforcement Learning Architecture), to create a massively distributed version of the DQN algorithm… Gorila DQN…trained much faster than the non-distributed version in terms of wall-time, reaching the performance of single GPU DQN roughly ten times faster.”).
Regarding claim 23, Dabney in view of  Nair teaches the system of claim 22, wherein the decision-making subsystem comprises a co-ordinator configured to: receive the state information from the plurality of agents; determine a set of actions for the plurality of agents in dependence on the received state information; and send instructions to each of the plurality of agents to perform the determined actions, and wherein each of the plurality of agents is arranged to receive the instructions from the co-ordinator and to generate the action signal based on the received instructions(Nair, pgs. 4-5, right-column, see also fig. 2(Learner, Q Network, Parameter Server, and Bundled Mode), “The simplest overall instantiation of Gorila, which we consider in our subsequent experiments, is the bundled mode in which there is a one-to-one correspondence between actors, replay memory, and learners (                        
                            
                                
                                    N
                                
                                
                                    a
                                    c
                                    t
                                
                            
                        
                    =                         
                            
                                
                                    N
                                
                                
                                    l
                                    e
                                    a
                                    r
                                    n
                                
                            
                        
                    ). Each bundle has an actor generating experience, a local replay memory to store that experience, and a learner that updates parameters based on samples of experience from the local replay memory. The only communication between bundles is via parameters: the learners communicate their gradients to the parameter server; and the Q-networks in the actors and learners are periodically synchronized to the parameter server.”).  
Regarding claim 24, Dabney in view of Nair teaches the system of claim 23, wherein the co-ordinator is configured to determine a set of actions for the plurality of agents in order to avoid a predetermined set of states of the environment(Nair, pgs. 4-5, right-column, see also fig. 2(Learner, Q Network, Parameter Server, and Bundled Mode), “The simplest overall instantiation of Gorila, which we consider in our subsequent experiments, is the bundled mode in which there is a one-to-one correspondence between actors, replay memory, and learners (                        
                            
                                
                                    N
                                
                                
                                    a
                                    c
                                    t
                                
                            
                        
                    =                         
                            
                                
                                    N
                                
                                
                                    l
                                    e
                                    a
                                    r
                                    n
                                
                            
                        
                    ). Each bundle has an actor generating experience, a local replay memory to store that experience, and a learner that updates parameters based on samples of experience from the local replay memory. The only communication between bundles is via parameters: the learners communicate their gradients to the parameter server; and the Q-networks in the actors and learners are periodically synchronized to the parameter server.”).

Claims 27 is rejected under 35 U.S.C. 103 as being unpatentable over Dabney et al., WO 2018/189404 A1(“Dabney”) in view of Mahmud, Tahmida, et al. "A poisson process model for activity forecasting." 2016 IEEE International Conference on Image Processing (ICIP). IEEE, 2016(“Mahmud”).
Regarding claim 27 Dabney teaches the system of claim 26, but does not teach: wherein: the environment comprises a domain having a temporal dimension and the probabilistic model comprises a distribution of a stochastic intensity function, wherein an integral of the stochastic intensity function over a sub-region of the domain corresponds to a rate parameter of a Poisson distribution for a predicted number of events occurring in the sub-region.
However, Mahmud teaches wherein: the environment comprises a domain having a temporal dimension and the probabilistic model comprises a distribution of a stochastic intensity function, wherein an integral of the stochastic intensity function over a sub-region of the domain corresponds to a rate parameter of a Poisson distribution for a predicted number of events occurring in the sub-region(Mahmud, pg. 3340-3341, left-col, see also fig. 1,  “If a video is observed upto time t with K occurrences of activities,                         
                            A
                            =
                            
                                
                                    
                                        
                                            
                                                
                                                    a
                                                
                                                
                                                    i
                                                
                                            
                                        
                                    
                                
                                
                                    i
                                    =
                                    1
                                
                                
                                    K
                                
                            
                        
                    , the                         
                            
                                
                                    K
                                
                                
                                    t
                                    h
                                
                            
                        
                     activity,                         
                            
                                
                                    a
                                
                                
                                    k
                                
                            
                        
                     occurred at time                         
                            
                                
                                    t
                                
                                
                                    K
                                
                            
                        
                     and the occurrence time of the next unobserved activity,                         
                            
                                
                                    a
                                
                                
                                    k
                                    +
                                    1
                                
                            
                        
                    , is                         
                            
                                
                                    t
                                
                                
                                    k
                                
                            
                            +
                            
                                
                                    T
                                
                                
                                    K
                                
                            
                        
                    , then we want to predict this inter-activity time                         
                            
                                
                                    T
                                
                                
                                    K
                                
                            
                        
                     as shown in Fig. 1[i.e. the environment comprises a domain having a temporal dimension]… [f]or an IPP with rate function                         
                            λ
                            (
                            t
                            )
                        
                     the expected number of events in the time interval [                        
                            
                                
                                    t
                                
                                
                                    1
                                
                            
                            ,
                             
                            
                                
                                    t
                                
                                
                                    2
                                
                            
                        
                    ] is                         
                            
                                
                                    N
                                
                                
                                    
                                        
                                            t
                                        
                                        
                                            1
                                        
                                    
                                    ,
                                     
                                    
                                        
                                            t
                                        
                                        
                                            2
                                        
                                    
                                
                            
                            =
                            
                                
                                    ∫
                                    
                                        
                                            
                                                t
                                            
                                            
                                                1
                                            
                                        
                                    
                                    
                                        
                                            
                                                t
                                            
                                            
                                                2
                                            
                                        
                                    
                                
                                
                                    λ
                                    
                                        
                                            t
                                        
                                    
                                    d
                                    t
                                
                            
                        
                    … [i]n LGCP,                         
                            λ
                            
                                
                                    t
                                
                            
                        
                     is assumed to be stochastic. We assume the occurrences of all the activities as a single Poisson process and associate a specific intentisy function                         
                            λ
                            
                                
                                    t
                                
                            
                        
                     =                         
                            e
                            x
                            p
                            ⁡
                            (
                            f
                            
                                
                                    t
                                
                            
                            )
                        
                    … [s]o, finally the approximate density of                         
                            
                                
                                    T
                                
                                
                                    K
                                
                            
                        
                     becomes [equation 5][i.e. the probabilistic model comprises a distribution of a stochastic intensity function, wherein an integral of the stochastic intensity function over a sub-region of the domain corresponds to a rate parameter of a Poisson distribution for a predicted number of events occurring in the sub-region].” ).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Dabney with the teachings of Mahmud the motivation to do so would be to model waiting-to-occur-events in a reinforcement environment using a Poisson distribution(Mahmud, pg. 3339-3340, “We leverage upon a Poisson process for modeling the inter-activity time. Because of the bursty nature of activities in most of the video datasets, their interarrival times generally follow an exponential distribution. It is therefore justified to consider them as a part of a Poisson process since the distribution of inter-arrival time for a Poisson.”). 

Claims 29 is rejected under 35 U.S.C. 103 as being unpatentable over Dabney et al., WO 2018/189404 A1(“Dabney”) in view of Hensman, James, et al. "Variational Fourier features for Gaussian processes." arXiv preprint arXiv:1611.06740 (2016)(“Hensman”).
Regarding claim 29, Dabney teaches the system of claim 27, further comprising a model learner configured to process model input data to generate the probabilistic model, wherein: the model input data comprises data indicative of events occurring in past states of the environment(Dabney, pg. 19, para. 0079, “For example, the system may update the projected sample update for the Q return having an index (i.e., in the range {0, ... ,N — 1}) that matches the floor of the remapped sample update[i.e. the model input data comprises data indicative of events occurring in past states of the environment]. In this example, the system can update the projected sample update for the Q return having an index that matches the floor of the remapped sample update….”). 
But Dabney does not teach and processing the model input data to generate the probabilistic model comprises applying a Bayesian inference scheme to the model input data, wherein applying the Bayesian inference scheme comprises: generating a variational Gaussian process corresponding to a distribution of a latent function, the variational Gaussian process being dependent on a prior Gaussian process and a plurality of randomly-distributed inducing variables, the inducing variables having a variational distribution and expressible in terms of a plurality of Fourier components; determining, using the data indicative of events occurring in past states of the environment, a set of parameters for the variational distribution, wherein determining the set of parameters comprises iteratively updating a set of intermediate parameters to determine an optimal value of an objective function, the objective function being dependent on the inducing variables and expressible in terms of the plurality of Fourier components; and determining, from the variational Gaussian process and the determined set of parameters, the distribution of the stochastic intensity function, wherein the distribution of the stochastic intensity function corresponds to a distribution of a square of the latent function.
  However, Hensman teaches: and processing the model input data to generate the probabilistic model comprises applying a Bayesian inference scheme to the model input data, wherein applying the Bayesian inference scheme comprises: generating a variational Gaussian process corresponding to a distribution of a latent function, the variational Gaussian process being dependent on a prior Gaussian process and a plurality of randomly-distributed inducing variables, the inducing variables having a variational distribution(Hensman, pgs. 6-7, “The variational approximation to Gaussian processes provides a more elegant, flexible and extensible solution in that the posterior distribution of the original model is approximated, rather than the model itself [i.e. applying a Bayesian inference scheme to the model input data, wherein applying the Bayesian inference scheme comprises: generating a variational Gaussian process corresponding to a distribution of a latent function]…we introduce a set of pseudo inputs (or `inducing points') Z…[w]e collect the values of the function at Z into a vector u…and write the process conditioned on these values as [equation 22][i.e. the variational Gaussian process being dependent on a prior Gaussian process and a plurality of randomly-distributed inducing variables, the inducing variables having a variational distribution]…) and expressible in terms of a plurality of Fourier components determining, using the data indicative of events occurring in past states of the environment, a set of parameters for the variational distribution, wherein determining the set of parameters comprises iteratively updating a set of intermediate parameters to determine an optimal value of an objective function, the objective function being dependent on the inducing variables and expressible in terms of the plurality of Fourier components(Hensman, pg. 13, “To obtain a Fourier feature that has finite variance, we can simply truncate the integral, using the limits a, b [resulting in equation 43]… [f]or x                         
                            ∈
                            [
                            a
                            ,
                            b
                            ]
                        
                    , the covariance between such inducing variables and the GP at x is [the resulting equation 45]… [t]hese inter-domain inducing variables have two desirable properties: they have finite variance and their associated covariance matrix                         
                            
                                
                                    K
                                
                                
                                    u
                                    u
                                
                            
                        
                     can be written as a diagonal matrix plus some rank one matrices…[f]urthermore, the feature vector                         
                            
                                
                                    K
                                
                                
                                    u
                                
                            
                        
                    , which is made by evaluating                         
                            c
                            o
                            v
                            (
                            
                                
                                    u
                                
                                
                                    m
                                
                            
                            ,
                             
                            f
                            
                                
                                    x
                                
                            
                            )
                        
                     is almost sinusoidal, aside from some rescaling and edge effects.”); and determining, from the variational Gaussian process and the determined set of parameters, the distribution of the stochastic intensity function, wherein the distribution of the stochastic intensity function corresponds to a distribution of a square of the latent function(Hensman, pg. 16, “[U]sing the fact that the functions from                         
                            ϕ
                        
                     are very regular it is possible to extend the operators                         
                            
                                
                                    P
                                
                                
                                    
                                        
                                            ϕ
                                        
                                        
                                            m
                                        
                                    
                                
                            
                            :
                            h
                            ↦
                             
                            
                                
                                    
                                        
                                            
                                                
                                                    ϕ
                                                
                                                
                                                    m
                                                
                                            
                                            ,
                                             
                                            h
                                        
                                    
                                
                                
                                    H
                                
                            
                        
                     to square integrable functions using integration by parts… [i]t is now possible to apply these operators to the Gaussian process in order to construct the inducing variables [as seen in equation 55]  and covariance of the inducing variable [as seen in equation 56].”).
 It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Dabney with the teachings of Hensman the motivation to do so would be to model waiting-to-occur-events in a reinforcement environment using Gaussian noise (Hensman, pgs. 24-27, “The data set consists of flight arrival and departure times for every commercial flight in the USA for the year 2008. Each record is complemented with details on the flight and the aircraft. We predict the delay of the aircraft at landing (in minutes), y… for this estimation problem, we use a Gaussian process regression model and assume the observations to be corrupted by independent Gaussian noise.”). 

Claims 32-33 are rejected under 35 U.S.C. 103 as being unpatentable over Dabney et al., WO 2018/189404 A1(“Dabney”) in view of Mnih, Volodymyr, et al. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013)(“Mnih”).
Regarding claim 32, Dabney teaches the system of claim 28, but does not teach: further comprising a model input subsystem for pre- processing the model input data in preparation for processing by the model learner, wherein pre-processing the model input data comprises at least one of: cleaning the model input data; transforming the model input data; and validating the model input data.
However, Mnih teaches: further comprising a model input subsystem for pre- processing the model input data in preparation for processing by the model learner, wherein pre-processing the model input data comprises at least one of: cleaning the model input data; transforming the model input data; and validating the model input data(Mnih, pgs. 5-6,  “[W]e apply a basic preprocessing step aimed at reducing the input dimensionality. The raw frames are preprocessed by first converting their RGB representation to gray-scale and down-sampling it to a 110 x 84 image. The final input representation is obtained by cropping an 84 x 84 region of the image that roughly captures the playing area…accurately evaluating the
progress of an agent during training can be challenging. Since our evaluation metric…is the total reward the agent collects in an episode or game averaged over a number of games, we periodically compute it during training.”). 
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to modify the teachings of Dabney with the teachings of Mnih the motivation to do so would be to decrease the computation required for image based reinforcement learning (Mnih, pg. 5, “Working directly with raw Atari frames, which are 210 x 160 pixel images with a 128 color palette, can be computationally demanding, so we apply a basic preprocessing step aimed at reducing the input dimensionality.”). 
Regarding claim 33, Dabney in view of Mnih teaches the system of claim 32, wherein the model input subsystem is configured to validate the model input data by checking whether the model input data includes one or more expected fields(Mnih, pgs. 5, “The final cropping stage is only required because we use the GPU implementation of 2D convolutions…expects square inputs.”).7  
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
US 10,346,741 B2 (details an asynchronous deep reinforcement learning  environment with multiple-agents interacting in an environment using a distributed computing system)
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Adam Clark Standke whose telephone number is (571)270-1806. The examiner can normally be reached 10AM-7PM M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael J Huntley can be reached on (303) 297-4307. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



Adam Clark Standke
Assistant Examiner
Art Unit 2129
/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129


    
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
    

    
        1 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim requiring one or more elements but not all.
        2 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim requiring one or more elements but not all.
        3 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim requiring one or more elements but not all.
        4 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim requiring one or more elements but not all.
        5 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim requiring one or more elements but not all.
        6 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim requiring one or more elements but not all.
        7 According to the broadest reasonable interpretation (BRI), the use of alternative language amounts to the claim requiring one or more elements but not all.