DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Arguments
Applicant’s argument regarding claim 1 for Fan et al., Ohta et al. and Brunstetter et al. have been fully considered but not persuasive. Examiner’s response is set forth below. 
Applicant’s argument regarding Fan et al.  that the training the machine learning model can occur off-line as part of the initial development of the system is considered but the examiner respectfully traverses the argument. Training “can occur offline” suggest the training can be doe both online and offline. Furthermore [0058] states “Once deployed, the machine learning model 153 can be continually trained 510 or updated. For example, the training module uses data captured in the field to further train the machine learning model 153”...That is the machine learning model is trained with real time data. Applicant’s argument regarding Ohta et al. not mentioning the word training and definition of temperature are not relevant because Ohta et al. is used to teach time space interpolation on the environmental data, not for training the machine learning model which is taught by Fan et al. Applicant’s argument regarding Fan et al., Ohta et al. and Brunstetter not teaching the overall claim 1 specifically calculating reward from the environment state, the next state and the action and updating a parameter of the exploration model is updated based on the environment state, the next state, the action and the reward have been fully considered but in moot in view of newly cited reference Hafner et al. with combination with prior arts of record Fan et al., Ohta et al. Hafner et al. explicitly teaches ensemble of machine learning models and neural networks are used to train the models and calculate the reward based on current state of the environment, action taken based on the environment state and output of the environment state after the action is taken and updating the model parameters based on the reward and the environmental states as taught in [0008] and [0041]. Therefore it would have been obvious before the effective filing date of the claimed invention to a person of ordinary skill in the art to modify the action optimization device as taught by combination of Fan et al. and Ohta et al. to include separate trained exploration model (machine learning model) wherein reward is calculated based on environment state, next environment state and action taken based on the environment state and updating the exploration model (machine learning model) based on the reward, environment state, next environment state and action taken based on the environment state as taught by Hafner et al. to divide the burden of computation and determining course of action between ensemble of machine learning models to improve overall system computation and implementation accuracy.
Claim Objections
Claim 1 is objected to because of the following informalities:  
Regarding claim 1, in line 4 the terms “the device” is missing the terms “action optimization” before the terms “the device”. Suggested correction in line 4 replace the terms “the device” with “the action optimization device”.
Appropriate correction is required.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.

Claims 1-2, 4-7 and 9  rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.  
 Regarding claim 1, in line 33, the limitation “a first step in which an environment state s3 into the exploration model....” is unclear. It is not possible to determine from the claimed limitation about how the environment state s3 is getting into the exploration model. For examining purposes examiner interpreted “....environment state s3 is inputted into the exploration model...”
	Claims 2, 4-7 and 9 depend on claim 1 inheriting every limitation of claim 1 therefore rejected under 35 U.S.C. 112(b) for the reasons discussed above.
 
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1,2,4 and 6-9 are rejected under 35 U.S.C. 103 as being unpatentable over Fan et al. (US 20190187635 A1) in view of Ohta et al. (US 20200134891 A1) and Hafner et al. (US 20210201156 A1). 
The teachings of Fan et al. and Ohta et al. as disclosed in the previous office action are hereby incorporated by reference to the extent applicable to the amended claims. 
Regarding claim 1, Fan et al. teaches, an action optimization device for optimizing an action for air conditioning in a target space (system having device to control an environmental system for a man-made structure, [0015]), comprising a processor and a memory connected to the processor (processor receiving instructions and data from memory [0085]), the device comprising : 
an environmental data acquisition unit configured to acquire environmental
data related to a state of the environment in the target space including a people flow, a temperature, and a humidity and to store the acquired environmental data in an environmental data storage unit (environmental data captured by sensors such as temperature, humidity and occupancy related to the target space  within the man-made structure, [0018] and [0033]);
	an environment production model training unit (machine learning model [0041]) configured to train an environment reproduction model, based on the environmental data, such that, when a state of an environment s1 and an action a1 for controlling the environment are input, a correct answer value of an environmental state s1 after the action a1 is output (the machine learning model stored on the memory is trained in a supervised manner with known environmental states and actions and afterwards environmental states after some time interval -training set data, [0041], [0055], [0058] and [0085]);
an environment reproduction unit configured to predict a second environment state corresponding to the first environment state and the first action by using the trained environment reproduction model (trained machine learning model predicts the environment in the room for some time in the future based on current state and feedback (action), [0039] and [0071 ]-[0073]); and
an output unit configured to output a result of the exploration (deciding which action to take from a state by employing exploration, [0082]).
Fan et al. does not teach the detail of performing time/space interpolation on the acquired environmental data according to a preset algorithm and training an exploration model which explore for an action to be taken for the second environmental state and outputting a result of the exploration and inputting the first environment state, the second environment state, and the first action to reward preset function, calculating reward based on the inputted states and action and updating the exploration model based on the reward. However Fan et al. explicitly teaches acquiring data which are used by the machine learning model to predict future environmental states in [0039] and [0071]-[0073], and the controller explores different courses of action to be taken based on the predicted state in [0073]. In Fan et al. one model is performing the combined action of predicting states and determining course of action based on the predicted instead of two models where one predicts state and another determines course of action based on the predicted state of the other model. But Fan et al. mentioned the ensemble of machine learning models can be used to perform the combined actions in [0039].

Ohta et al. teaches, an environmental data interpolation unit configured to perform time/space interpolation on the acquired environmental data according to a preset algorithm (" ... so as to calculate continuous values in terms of time or space, and perform interpolation of environmental data values of the air-conditioned space of discrete values acquired by the environmental data value acquisition unit, so as to calculate continuous values1 in terms of time or space ... ", [0052] and Fig.6).
Therefore it would have been obvious before the effective filing date of the
claimed invention to a person of ordinary skill in the art to apply the teachings of training
the environmental reproduction model based on acquired environmental data and using
the trained environmental reproduction model to predict state of the target space based
on current environmental state and course of action of the target space as taught by
Fan et al. where time/space interpolation is performed on the acquired environmental data as taught by Ohta et al. to improve data accuracy for the environmental reproduction model.

Neither in combination nor individually Fan et al. and Ohta et al. teach the detail of performing time/space interpolation on the acquired environmental data according to a preset algorithm and training an exploration model which explore for an action to be taken for the second environmental state and outputting a result of the exploration and inputting the first environment state, the second environment state, and the first action to reward preset function, calculating reward based on the inputted states and action and updating the exploration model based on the reward. But Fan et al. mentioned the ensemble of machine learning models can be used to perform the combined actions in [0039].

Hafner et al. teaches an exploration model training unit configured to train an exploration model such that an action a2 to be taken next is output when an environmental state s2 output from the environment reproduction model is input, and store the trained exploration model in the memory (an ensemble of Q network models, transition models and reward models are used, output from transition model is input to Q network model to generate output (next action to be taken) for training the Q network model, [0008]-[0009]);
wherein the environment reproduction unit is further configured to input the first
environment state, the second environment state, and the first action to a preset reward
function and to output a reward value (“...includes (i) an input observation, (ii) an action performed by the agent in response to the input observation, and (iii) a next observation characterizing a state that the environment transitioned into as a result of the agent performing the action in response to the observation and to process the reward input to generate a predicted reward received by the agent in response to performing the action”, [0008]),
the exploration model training unit performs training of the exploration model by
performing the following steps:
a first step in which an environment state s3 is inputted into the exploration model to acquire an action a3 to be taken next (the Q network model determines what action to take based on environment state output from the transition model, [0049], [0051] and [0054]);
a second step in which a next state s3' of the environment state s3 and a
reward r are acquired from the environment reproduction unit, the reward r being
calculated from the environment state s3, the next state s3', and the action a3 (“...includes (i) an input observation, (ii) an action performed by the agent in response to the input observation, and (iii) a next observation characterizing a state that the environment transitioned into as a result of the agent performing the action in response to the observation and to process the reward input to generate a predicted reward received by the agent in response to performing the action” [0008], [0051] and [0054]); and
a third step in which a parameter of the exploration model is updated using the environment state s3, the next state s3', the reward r, and the action a3 (updating the neural network of the Q network model based on the interactions of the agents (including transition model and rewards model) with the environment, [0041] and [0054]).
Therefore it would have been obvious before the effective filing date of the claimed invention to a person of ordinary skill in the art to modify the action optimization device as taught by combination of Fan et al. and Ohta et al. to include separate trained exploration model (Q network model working with transition models and reward models) wherein reward is calculated based on environment state, next environment state and action taken based on the environment state and updating the exploration model (Q network model working with transition models and reward models) based on the reward, environment state, next environment state and action taken based on the environment state as taught by Hafner et al. to divide the burden of computation and determining course of action between ensemble of machine learning models to improve overall system computation and implementation accuracy.
Fan et al. teach:
[0018] The control system 150 can receive various types of inputs, and from various
sources. This includes environmental data 131 captured by sensors 130 that monitor
the environment within the man-made structure. Examples include temperature,
humidity, pressure and air quality data. Air quality might include the concentration of
allergens or of particulates of a certain size. It might also include the detection of certain
substances: carbon monoxide, smoke, fragrances, negative ions, or other hazardous or
desirable substances. Environmental data 131 can also include lighting levels and
lighting color.


[0033] FIGS. 3A and 3B are a diagram illustrating a high-level flow for controlling an environmental system, according to an embodiment. FIG. 3B is a continuation of FIG. 3A. Whereas FIG. 1 illustrates control concepts in the form of a system block diagram, FIG. 3 organizes these concepts as a flow of data, actions and results. The input data 310 in FIG. 3A correspond to the inputs to the control system150 in FIG. 1. The input data 310 includes sensor data 131 that characterizes the environment, data136 for tracking occupants and other objects, data 137 from external sources, data 138 from occupants, operational data 112 from the environmental systems themselves, profile information 142for the man-made structure and its occupants, and historical data 143. FIG. 3A lists examples of each of these categories, which were described previously with respect to FIG. 1.


[0041] The training module receives 511 a training set for training the machine learning
model in a supervised manner. Training sets typically are historical data sets of inputs
and corresponding responses. The training set samples the operation of the
environmental system, preferably under a wide range of different conditions. FIG. 3A
gives some examples of input data 310 that may be used for a training set. The
corresponding responses are observations after some time interval2, such as the actual
temperature and humidity achieved, energy consumed and cost during the time interval,
occupant comfort feedback, etc.


[0055] In typical training 512, a training sample is presented as an input to the machine
learning model 153, which then predicts an output for a particular attribute. The
difference between the machine learning model's output and the known good output is
used by the training module to adjust the values of the parameters (e.g., features, weights, or biases) in the machine learning model 153. This is repeated for many different training samples to improve the performance of the machine learning model 153 until the deviation between prediction and actual response is sufficiently reduced.


[0058] Training 510 of the machine learning model 153 can occur off-line, as part of the initial development and deployment of system 100. The trained model 153 is then deployed in the field. Once deployed, the machine learning model 153 can be continually trained 510 or updated. For example, the training module uses data captured in the field to further train the machine learning model 153. Because the training 510 is more computationally intensive, it may be cloud-based.2

[0085] Alternate embodiments are implemented in computer hardware, firmware,
software, and/or combinations thereof. Implementations can be implemented in a computer program product tangibly embodied in a machine-readable storage device for execution by a programmable processor; and method steps can be performed by a programmable processor executing a program of instructions to perform functions by operating on input data and generating output. Embodiments can be implemented advantageously in one or more computer programs that are executable on a programmable system including at least one programmable processor coupled to receive data and instructions from, and to transmit data and instructions to, a data storage system, at least one input device, and at least one output device .....


[0039] FIG. 5 is a flow diagram illustrating training and operation of a machine learning
model 153,according to an embodiment. The process includes two main phases:
training 510 the machine learning model 153 and inference (operation) 520 of the
machine learning model 153. These will be illustrated using an example where the
machine learning model learns to predict the environment in rooms (e.g., temperature,
humidity, lighting) and the energy consumption/cost based on historical data. The
following example will use the term "machine learning model" but it should be
understood that this is meant to also include an ensemble of machine learning models.


[0082] To decide which action to take from a state, the control system 150 may employ
techniques of exploitation and exploration. Exploitation refers to utilizing known
information. For example, a past sample shows that under certain conditions, a
particular action was taken, and good results were achieved. The control system may
choose to exploit this information, and repeat this action if current conditions are similar
to that of the past sample.


	Hafner et al. teach:



[0008] The system also maintains an ensemble of reward models, each reward model being
configured to receive a reward input that includes (i) an input observation, (ii) an action performed by the agent in response to the input observation, and (iii) a next observation characterizing a state that the environment transitioned into as a result of the agent performing the action in response to the observation and to process the reward input to generate a predicted reward received by the agent in response to performing the action.


[0009] The system then uses the ensembles of Q networks, transition models, and reward models to generate target Q values for transitions and then uses those target Q values to train the ensemble of Q networks. In particular, the system generates multiple different trajectories from a single transition and then interpolates between target Q values as of multiple different time steps within the multiple trajectories to determine the final target Q value for the transition.

[0049] As another particular example, the policy neural network 110 can generate action selection outputs 122 that define probability distributions over possible actions be performed by the agent and the training engine 116 can use the Q networks 160 to update the model parameters 118 of the policy neural network 110 using an actor-critic reinforcement learning technique, e.g., an asynchronous advantage actor-critic (A3C) reinforcement learning technique. In other words, the training engine 116 can train the Q networks jointly with the policy neural network 110 using an actor-critic technique.

[0051] Each transition model in the ensemble is configured to receive a transition input that includes (i) an input observation and (ii) an action performed by the agent in response to the input observation and to process the transition input to generate a predicted next observation characterizing a state that the environment transitioned into as a result of the agent performing the action3 in response to the observation. In other words, the transition model is configured to predict the effect on the environment of the agent performing the input action when the environment is in the state characterized by the input observation.


[0054] The training engine 116 uses the ensembles of Q networks, transition models, and reward models to generate target Q values for transitions sampled from the transition buffer 114 and then uses those target Q values to train the ensemble of Q networks4, which either collectively make up the policy neural network 110 or are used to repeatedly update the model parameters of the policy neural network 110, i.e., using one of the techniques described above.


	Regarding claim 2 combination of Fan et al., Ohta et al. and Hafner et al. teach the action optimization device according to claim 1. In addition Hafner et al. teaches, the action exploration unit is configured to further explore for a second action to be taken for the second environmental state by using the trained exploration model (“At each of multiple time steps, the policy neural network 110 is configured to process an input that includes the current observation 120 characterizing the current state of the environment 104 in accordance with the model parameters 118 to generate an action selection output 122 (“action selection policy”),” [0033]). In addition Fan et al. teaches, the environment reproduction unit is configured to further predict a third environment state corresponding to the second environment state and the second action using the trained environment reproduction model (“To simulate subsequent states5, the control system 150 uses the trained machine learning model 153. When underlying conditions (e.g. weather) are changing. the machine learning model 153 can make predictions on what most likely will be observed as a result of actions taken. Based on these predictions, the control system 150 chooses a policy or action that most likely maximizes the metric of interest..., [0081]).

Regarding claim 4 combination of  Fan et al.,Ohta et al. and Hafner et al. teach the action optimization device according to claim 1. In addition Hafner et al. teaches, an environment prediction unit configured to perform future prediction by using a preset time-series analysis method based on the environmental data to generate environment prediction data (“At each time step6, the state of the environment at the time step depends on the state of the environment at the previous time step and the action performed by the agent at the previous time step,” [0020]) and store the generated environment prediction data in the environment data storage unit as environment data (all the data obtained are stored in the computer readable media suitable for storing computer program instructions and data,   [0116]). 

	Regarding claim 6 combination of  Fan et al., Ohta et al. and Hafner et al. teach the action optimization device according to claim 1. In addition Fan et al. teaches, a policy data acquisition unit configured to acquire policy data specifying information to be used for at least one processing of training the environment reproduction model, training the exploration model (training the Q network model in view of Hafner et al. [0033] and [0041]), predicting the second environmental state, and exploring for the second  action (the machine learning model works in conjunction with the policy engine to learn to determine certain course of action based on the policy for the predicted state, [0078] and [0079]).
	Fan et al. teach:
[0079] Based on the current state 630. a policy engine 651 determines which polices
might be applicable to the current state. This might be done using a rules-based
approach, for example. The machine learning model 153 predicts the result of each
policy. The different results are evaluated and a course of action is selected 657 and
then carried out by the controller 659. A set of metrics is used to evaluate the policies.
For example, if the comfort zone is defined as being within a range of temperatures and
humidity, then a policy that results in actual temperatures outside the comfort zone for
too long when occupants are present is scored poorly. A policy that results in a high
volume of occupant complaints is scored poorly....

Regarding claim 7 combination of  Fan et al., Ohta et al. and Hafner et al. teach the action optimization device according to claim 1. In addition Fan et al. teaches, the action exploration unit is configured to: explore for, as the second first action, an action of a group unit for a control target group obtained by grouping a plurality of control targets based on a predetermined criterion in advance ("To decide which action to take from a state, the control system 150 may employ techniques of exploitation and exploration. Exploitation refers to utilizing known information. For example, a past sample shows that under certain conditions, a particular action was taken, and good results were achieved. The control
system may choose to exploit this information, and repeat this action if current conditions are similar to that of the past sample7", [0082]), or a series of actions for one or more control targets for realizing a predetermined function (possible course of actions to be taken evaluated for a predicted state, [0078]).

	 Regarding claim 8 combination of Fan et al., Ohta et al. and Hafner et al. teach the claimed action optimization device. Therefore together they teach the action optimization method implementing the functional steps of the action optimization device as discussed in claim 1 above. 

	Regarding claim 9 combination of Fan et al., Ohta et al. and Hafner et al. teach the  action optimization device of any one of the claims 1 to 6. In  addition Fan et al. teaches, a non-transitory tangible computer-readable storage medium having stored thereon a program comprising instructions for causing a processor to execute so as to cause a computer (program stored in storage device where the storage device includes magnetic disks, hard disk, removable disks and etc. and the stored program when executed by a processor is to control an environmental system for a man-made structure, [0015]).

Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Fan et al. (US 20190187635 A1) in view of Ohta et al. (US 20200134891 A1) and Hafner et al. (US 20210201156 A1) and Kumaresan et al. (US 20220019678 A1).
Regarding claim 5 combination of Fan et al., Ohta et al. and Hafner et al. teach the action optimization device according to claim 1. In addition Fan et al. teaches, the environment reproduction model training unit is configured to train the environment reproduction model by using the environmental data (Based on environmental data, the machine learning model (environment reproduction model) is trained, [0041] and [0039]).
Neither in combination nor individually Fan et al., Ohta et al. and Hafner et al. teach
teach to perform data augmentation on the environmental data based on a random number. However Fan et al. explicitly teaches environmental data used for training went
through pre-processing in [0034].
Kumaresan et al. teaches, perform data augmentation on the environmental
data based on a random number (" .. It will also be appreciated that the irreversible
transform of step 201 can include first augmenting the first data element by applying a
unique salt value and subsequently generating a pseudo-random number with the augmented first data element as input seed8, or applying a hash function to the augmented first data element, or any combination of these techniques", [0054]).
Therefore it would have been obvious before the effective filing date of the
claimed invention to a person of ordinary skill in the art to modify the action optimization
device using pre-processed environmental data for training as taught by combination of
Fan et al., Ohta et al. and Hafner et al.  wherein the pre-processing involves data augmentation based on a random number as taught by Kumaresan et al. for improving deep learning robustness of the machine learning models.	 
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Camilus et al. (US 20190378020 A1) teaches a building energy system for controlling environmental state in a target space using optimal control settings by implementing reinforcement learning.

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ANZUMAN SHARMIN whose telephone number is (571)272-7365. The examiner can normally be reached M and Th 7:30am - 3:30pm and Tue 8:00am-12:00pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, THOMAS LEE can be reached on (571)272-3667. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/ANZUMAN SHARMIN/Examiner, Art Unit 2115



/THOMAS C LEE/Supervisory Patent Examiner, Art Unit 2115                                                                                                                                                                                                        


    
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
        
            
    

    
        1 To calculate continuous values by interpolating acquired environmental data in space or time, a preset
        algorithm must be used. Calculation is not possible without an equation or algorithm.
        2 Training the model with real time data. 
        3 Action performed by the Q network model in view of [0049]. 
        4 Output from transition model used as input to Q network model, see also [0059]-[0062]. 
        5 Determining/predicting subsequent states in future.
        6 time series analysis
        7 Someone of ordinary skill in the art can train the machine learning model to apply the course of action
        taken for one space inside a man-made structure ([0015]) to another space/spaces inside the man-made
        structure having same or similar environmental dynamics to yield predictable results. MPEP.2143.1. (D).
        8 Augmented first data element is the pre-processed environmental data used for training in view of Fan
        et al.