Detailed Action
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after Aug 28, 2019, is being examined under the first inventor to file provisions of the AIA .
Claim 1-6 are pending.
Claim 7 and 8 are not elected.

Election/Restrictions
Restriction to one of the following inventions is required under 35 U.S.C. 121:
I. Claims 1-6, drawn to learning methods  G06N3/08
II. Claims 7, drawn to explanation of inference steps  G06N5/045
III. Claim 8, drawn to machine learning  G06N20/00
The inventions are independent or distinct, each from the other because:
Inventions I, II and III are related as subcombinations disclosed as usable together in a single combination. The subcombinations are distinct if they do not overlap in scope and are not obvious variants, and if it is shown that at least one subcombination is separately usable. In the instant case subcombination II has separate utility such as providing explanation path of machine learning model other than neural network comprising a decision tree classifier. In the instant case subcombination III has separate utility such as validating prediction algorithms other than neural network by partitioning a validation data and inputting them into the algorithm. See MPEP § 806.05(d).
The examiner has required restriction between subcombinations usable together. Where applicant elects a subcombination and claims thereto are subsequently found allowable, any claim(s) depending from or otherwise requiring all the limitations of the allowable subcombination will be examined for patentability in accordance with 37 CFR 1.104. See MPEP § 821.04(a). Applicant is advised that if any claim presented in a continuation or divisional application is anticipated by, or includes all the limitations of, a claim that is allowable in the present application, such claim may be subject to provisional statutory and/or nonstatutory double patenting rejections over the claims of the instant application.
Restriction for examination purposes as indicated is proper because all the inventions listed in this action are independent or distinct for the reasons given above and there would be a serious search and/or examination burden if restriction were not required because one or more of the following reasons apply:
a. While the groupings are classified together, the groupings have a separate status in the art because the inventions perform different functions in the art. Inventions related to learning method generally provides inputting training data into one or more neural network and training the neural network to output desired data classified in CPC G06N3/08. Inventions related to explanation of inference step generally provides how to explain the result of inference classified in G06N5/045. Inventions related to machine learning generally provides evaluation method of machine learning classified in G06N20/00. These fields are separate area of inventive effort despite being classified in the under parent symbol/class of G06N of “COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS”.
Applicant is advised that the reply to this requirement to be complete must include (i) an election of an invention to be examined even though the requirement may be traversed (37 CFR 1.143) and (ii) identification of the claims encompassing the elected invention.
The election of an invention may be made with or without traverse. To reserve a right to petition, the election must be made with traverse. If the reply does not distinctly and specifically point out supposed errors in the restriction requirement, the election shall be treated as an election without traverse. Traversal must be presented at the time of election in order to be considered timely. Failure to timely traverse the requirement will result in the loss of right to petition under 37 CFR 1.144. If claims are added after the election, applicant must indicate which of these claims are readable upon the elected invention.
Should applicant traverse on the ground that the inventions are not patentably distinct, applicant should submit evidence or identify such evidence now of record showing the inventions to be obvious variants or clearly admit on the record that this is the case. In either instance, if the examiner finds one of the inventions unpatentable over the prior art, the evidence or admission may be used in a rejection under 35 U.S.C. 103 or pre-AlA 35 U.S.C. 103(a) of the other invention.
During a phone communication on July 11, 2022 applicant made an oral election without traverse of claim 1-6 and applicant is requested to cancel non-elected claim in response to the next office action.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claim 1-6 are rejected under 35 U.S.C. 101 because the claimed invention is directed to abstract idea without significantly more. 

Regarding claim 1, 
2A Prong 1: The limitation of provide a recommended probability of each action from the plurality of actions, wherein adjusts a raw action probability output to the recommended probability by incorporating domain knowledge of the process is a mental process, as it merely recites a process of using domain knowledge to make prediction of probability of each action. 
2A Prong 2: This judicial exception is not integrated into a practical application. The limitation of obtaining training data of a process, the training data comprising information about a current process state, an action from a plurality of actions applied to the current process state, a next process state obtained by applying the action to the current process state, a reward based on a metric of the process, the reward depending on the current process state, the action, and the future process state, and a long-term reward comprising the reward and one or more future rewards is an insignificant extra-solution activity.
2B: The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception. The limitation of obtaining training data of a process, the training data comprising information about a current process state, an action from a plurality of actions applied to the current process state, a next process state obtained by applying the action to the current process state, a reward based on a metric of the process, the reward depending on the current process state, the action, and the future process state and the future process state, and a long-term reward comprising the reward and one or more future rewards is a mere data gathering (MPEP 2106.05(g)), as it merely recites a process of obtaining different training data. The policy gradient algorithm and neural network is a field of use and technological environment (MPEP 2106.05(h)). Training a neural network on the training data is also a field of use and technological environment (MPEP 2106.05(h)), as training process is common in a field of machine learning.

Regarding claim 2, the limitation of wherein the incorporates an imaginary long-term reward of an augmented action to adjust the raw probability is a mental process, as it merely recites a process of using a value to adjust probability values. The policy gradient is a field of use and technological environment (MPEP 2106.05(h)), as it merely recites the name of the method.
This judicial exception is not integrated into a practical application. The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception.

Regarding claim 3, the limitation of wherein the training data further comprises one or more constraints on each action, with each constraint in a form of an action mask is a field of use and technological environment (MPEP 2106.05(h)), as it merely recites the name of the method.
This judicial exception is not integrated into a practical application. The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception.

Regarding claim 4, the limitation of incorporates an imaginary long-term reward of an augmented action and the one or more constraints to adjust the raw probability is a mental process, as it merely recites a process of using a value to adjust probability values. The policy gradient is a field of use and technological environment (MPEP 2106.05(h)), as it merely recites the name of the method.
This judicial exception is not integrated into a practical application. The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception.

Regarding claim 5, the limitation of wherein the process is a blast furnace process for production of molten steel, the metric is a chemical composition metric of the molten steel, and the one or more recommended probability of each action relates to operation of a fuel injection rate of the blast furnace are field of use and technological environment (MPEP 2106.05(h)).
This judicial exception is not integrated into a practical application. The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception.

Regarding claim 6, the limitation of a method for incorporation of a constraint on one or more actions, the method comprising application of an action mask to a probability of each action output is a mental process, as it merely recites using constraint on actions and applying some operation to probabilities of each action. A neural network, module, and training of a reinforcement learning module are field of use and technological environment (MPEP 2106.05(h)).
This judicial exception is not integrated into a practical application. The claim(s) does/do not include additional elements that are sufficient to amount to significantly more than the judicial exception.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.


Claim 6 is/are rejected under 35 U.S.C. 102 over Williams (Williams, 2017, “Hybrid Code Networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning”).

Regarding claim 6, Williams teaches a method for incorporation of a constraint on one or more actions in training of a reinforcement learning module, the method comprising application of an action mask to a probability of each action output by a neural network of the module (abstract, [Williams, page 666, right column, line 13-19; page 2, Figure 1] “The feature components from steps 1-5 are concatenated to form a feature vector (step 6). This vector is passed to an RNN, such as a long short-term memory (LSTM). The RNN computes a hidden state (vector), which is retained for the next timestep (step 8), and passed to a dense layer with a softmax activation, with output dimension equal to the number of distinct system action templates (step 9). Thus the output of step 9 is a distribution over action templates. Next, the action mask is applied as an element-wise multiplication, and the result is normalized back to a probability distribution (step 10) – this forces non-permitted actions to take on probability zero. From the resulting distribution (step 11), an action is selected (step 12)”, the number 6 , 7, 9, and 10 of Figure 1 shows the process of applying Action mask to the output of RNN and obtaining probability distributions. The paragraph also mentions the process of applying action mask applies constraint to the actions (probability zero for non-permitted actions), Figure 1 mentions that shaded boxes are trainable component, which are RNN and Dense+Softmax. [William, page 667, left column, line 5-11] “APIs can act as sensors and return features relevant to the dialog, so these can be added to the feature vector in the next timestep (step 16). If the action is text, it is rendered to the user (step 17), and cycle then repeats. The action taken is provided as a feature to the RNN in the next timestep (step 18)”, the one or more actions are provided to the Recurrent Neural Network).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

	Claim 1 and 3 are rejected under 35 U.S.C. 103 over Palanisamy (US 20200033869 A1) in view of Williams (Williams, 2017, “Hybrid Code Networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning”).

Regarding claim 1, Palanisamy teaches a method comprising: obtaining training data of a process, the training data comprising information about a current process state, an action from a plurality of actions applied to the current process state, a next process state obtained by applying the action to the current process state, a reward based on a metric of the process, the reward depending on the current process state, the action, and the future process state ([Palanisamy, 0023] “Each DRL algorithm is configured to process data relating to driving experiences using stochastic gradient updates to train a neutral network … output of the DRL algorithm comprises one or more of: estimated values of state/action/advantage as determined by a state/action/advantage value function; and a policy distribution. Each of the driving policy learner modules further comprises: a learning target module configured to process trajectory steps of a driver agent within a driving environment to compute desired learning targets that are desired to be achieved, where each trajectory step comprises: a state, an observation, an action, a reward, a next-state and a next-observation, and where each learning target represents a result of an action that is desired for a given driving experience”, DRL algorithm used the state, action, reward, next-state, next-observation, result of action to train a neural network, 
 [Palanisamy, 0073] “The driver agents can collect driving experiences to create a knowledge base that is stored in an experience memory. The driving policy learner modules can process the collective driving experiences to extract driving policies (or rules) and/or bootstrap new learning paradigms. The driver agents can be trained via the driving policy learner modules in a parallel and distributed manner without having to rely on labelled data or external supervision”, the driver agent can be trained by driving policy learner which comprises all the states, rewards.. etc. [Palanisamy, 0104] “To explain further, in DRL, the agent uses a deep neural network to learn the long-term value of a state/action. The DRL based agent can also use a deep neural network to learn the mappings between state and actions”, teaches the agent can be a neural network, and obtaining the data. 
[Palanisamy, 0007] “… the data for each driving experience (that represents a particular driving environment at a particular time) comprises: a state of the particular driving environment observed by a corresponding driving environment processor; … a reward comprising: a signal that signifies how desirable an action performed by the driver agent is at a given time under particular environment conditions, wherein the reward is automatically computed based on road rules and driving principles extracted from human driving data or defined using other appropriate methods based on traffic and the road rules”, teaches the detail about the reward, observations, actions, state, next-state, and teaches the metric of reward); and 
a long-term reward comprising the reward and one or more future rewards ([Palanisamy, 0104] “By performing an action, the agent transitions from state to state. Executing an action in a specific state provides the agent with a reward (a numerical score). The neural network uses coefficients to approximate the function relating inputs to outputs, and learns to find the right coefficients, or weights, by iteratively adjusting those weights along gradients that promise less error. The goal of the agent is to maximize its total (future) reward. It does this by adding the maximum reward attainable from the future state to the reward in its current state”, the long term goal of the method is to maximize the reward); and 
Palanisamy teaches training a neural network on the training data to provide a recommended probability of each action from the plurality of actions ([Palanisamy, 0023] “Each DRL algorithm is configured to process data relating to driving experiences using stochastic gradient updates to train a neutral network … output of the DRL algorithm comprises one or more of: estimated values of state/action/advantage as determined by a state/action/advantage value function; and a policy distribution. Each of the driving policy learner modules further comprises: a learning target module configured to process trajectory steps of a driver agent within a driving environment to compute desired learning targets that are desired to be achieved, where each trajectory step comprises: a state, an observation, an action, a reward, a next-state and a next-observation, and where each learning target represents a result of an action that is desired for a given driving experience”, DRL algorithm used the state, action, reward, next-state, next-observation, result of action to train a neural network).
Palanisamy does not explicitly teach training on the training data to provide a recommended probability of each action from the plurality of actions, wherein a policy gradient algorithm adjusts a raw action probability output by the neural network to the recommended probability by incorporating domain knowledge of the process.
Williams teaches training a neural network on the training data to provide a recommended probability of each action from the plurality of actions ([William, page 666, right column, second paragraph; 6-10 of Figure 1] “The feature components from steps 1-5 are concatenated to form a feature vector (step 6). This vector is passed to an RNN, such as a long shortterm memory (LSTM) (Hochreiter and Schmidhuber, 1997) or gated recurrent unit (GRU) (Chung et al., 2014). The RNN computes a hidden state (vector), which is retained for the next timestep (step 8), and passed to a dense layer with a softmax activation, with output dimension equal to the number of distinct system action templates (step 9).1 Thus the output of step 9 is a distribution over action templates. Next, the action mask is applied as an element-wise multiplication, and the result is normalized back to a probability distribution (step 10) – this forces non-permitted actions to take on probability zero. From the resulting distribution (step 11), an action is selected (step 12). When RL is active, exploration is required, so in this case an action is sampled from the distribution; when RL is not active, the best action should be chosen, and so the action with the highest probability is always selected”), wherein a policy gradient algorithm adjusts a raw action probability output by the neural network to the recommended probability by incorporating domain knowledge of the process ([William, page 671, left column, last paragraph – right column, first paragraph] “For optimization, we selected a policy gradient approach (Williams, 1992), which has been successfully applied to dialog systems (Jurˇc´ıˇcek et al.,2011), robotics (Kohl and Stone, 2004), and the board game Go (Silver et al., 2016). In policy gradient-based RL, a model is parameterized by w and outputs a distribution from which actions are sampled at each timestep. At the end of a trajectory – in our case, dialog – the return G for that trajectory is computed, and the gradients of the probabilities of the actions taken with respect to the model weights are computed. The weights are then adjusted by taking a gradient step proportional to the return:             
                w
                ←
                w
                +
                α
                (
                
                    
                        ∑
                        
                            t
                        
                    
                    
                        
                            
                                ∇
                            
                            
                                w
                            
                        
                        l
                        o
                        
                            
                                g
                            
                            ⁡
                            
                                π
                                
                                    
                                        
                                            
                                                α
                                            
                                            
                                                t
                                            
                                        
                                    
                                    
                                        
                                            
                                                h
                                            
                                            
                                                t
                                            
                                        
                                        ;
                                        w
                                    
                                
                            
                        
                    
                
                )
                (
                G
                -
                b
                )
            
         where             
                α
            
         is a learning rate;             
                
                    
                        α
                    
                    
                        t
                    
                
            
         is the action taken at timestep t; ht is the dialog history at time t; G is the return of the dialog; OxF denotes the Jacobian of F with respect to x; b is a baseline described below; and             
                π
                
                    
                        
                            
                                α
                            
                            
                                t
                            
                        
                    
                    
                        
                            
                                h
                            
                            
                                t
                            
                        
                        ;
                        w
                    
                
            
         is the LSTM – i.e., a stochastic policy which outputs a distribution over a given a dialog history h, parameterized by weights w. The baseline b is an estimate of the average return of the current policy, estimated on the last 100 dialogs using weighted importance sampling.5”, discloses the addition of previous knowledge history, [Williams, Abstract, line 6-10; Table 1] “We introduce Hybrid Code Networks (HCNs), which combine an RNN with domain-specific knowledge encoded as software and system action templates”, and [Williams, page 670, right column, second paragraph, line 1-5] “In these domains, we have a further source of knowledge: the rule-based dialog managers themselves can be used to generate example “sunnyday” dialogs, where the user provides purely expected inputs” also discloses the injection of domain knowledge and the source of the knowledge is a dialog).
	Before the effective filing date of the invention to a person of ordinary skill in the art, it would
have been obvious, having the teachings of Palanisamy and Williams to use the training data to provide a recommended probability of each action of Williams to implement the method of training a reinforcement learning system of Palanisamy. The suggestion and/or motivation to do so is to improve the accuracy of the model with limited data, as providing prior knowledge can amplify the limited data ([Williams, page 673, left column, the last paragraph, line 17-19] “and in resource poor settings, providing domain knowledge can amplify limited data”) 

Regarding claim 3, Palanisamy in view of Williams teaches wherein the training data further comprises one or more constraints on each action, with each constraint in a form of an action mask ([Williams, page 666, right column, line 13-18] “Thus the output of step 9 is a distribution over action templates. Next, the action mask is applied as an element-wise multiplication, and the result is normalized back to a probability distribution (step 10) – this forces non-permitted actions to take on probability zero”, applying action mask applies constraint to the actions (probability zero for non-permitted actions) ).

	Claim 2 and 4 are rejected under 35 U.S.C. 103 over Palanisamy (US 20200033869 A1) in view of Williams (Williams, 2017, “Hybrid Code Networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning”), and further in view of Kantor (US 20200250279 A1).

Regarding claim 2, Palanisamy in view of Williams teaches the reinforcement learning method with policy gradient method ([Williams, page 671, left column, 6 Reinforcement learning illustration, second paragraph – right column, first paragraph] “For optimization, we selected a policy gradient approach (Williams, 1992), which has been successfully applied to dialog systems (Jurˇc´ıˇcek et al., 2011), robotics (Kohl and Stone, 2004), and the board game Go (Silver et al., 2016). In policy gradient-based RL, a model is parameterized by w and outputs a distribution from which actions are sampled at each timestep. At the end of a trajectory – in our case, dialog – the return G for that trajectory is computed, and the gradients of the probabilities of the actions taken with respect to the model weights are computed”).
Palanisamy in view of Williams does not specifically teach wherein incorporates an imaginary long-term reward of an augmented action to adjust the raw probability.
Kantor teaches wherein the policy gradient incorporates an imaginary long-term reward of an augmented action to adjust the raw probability ([Kantor, 0042] “In particular, a decision making problem may be a Markov Decision process (MDP) with finite state and action spaces. In general, a finite MDP may be expressed as a tuple (X, A, R, D, custom-character, custom-character.sub.0) where X={1, . . . , n, x.sub.Ter} and A={1, . . . , m} are the state and action spaces, respectively, and x.sub.Ter is a recurrent terminal state. For a state x and an action a, R(x, a) may be a bounded reward function, and D.sub.1(x, a), . . . , D.sub.n(x, a) the constraints cost function. custom-character(.Math.|x, a) may be the transition probability distribution, and P.sub.0(.Math.) to be the initial state distribution. A stationary policy μ(.Math.|x) for an MDP is a probability distribution over actions, conditioned on the current state. In policy gradient methods, such policies can be parameterized by a k-dimensional vector θ, using this notation we can write the space of policies as μ(.Math.|x; θ), x∈ X, θ∈ custom-character.sup.k”, teaches the policy gradient method using reward function, and constraint cost functions to adjust the stationary policy which is a probability distribution over actions).
	Before the effective filing date of the invention to a person of ordinary skill in the art, it would
have been obvious, having the teachings of Kantor, Palanisamy and Williams to use the method of using constraint and reward in policy gradient model of Kantor to implement the method of training a reinforcement learning system of Williams and Palanisamy. The suggestion and/or motivation to do so is to improve the accuracy of the reinforcement model.

Regarding claim 4, Palanisamy in view of Williams teaches the reinforcement learning method with policy gradient method ([Williams, page 671, left column, 6 Reinforcement learning illustration, second paragraph – right column, first paragraph] “For optimization, we selected a policy gradient approach (Williams, 1992), which has been successfully applied to dialog systems (Jurˇc´ıˇcek et al., 2011), robotics (Kohl and Stone, 2004), and the board game Go (Silver et al., 2016). In policy gradient-based RL, a model is parameterized by w and outputs a distribution from which actions are sampled at each timestep. At the end of a trajectory – in our case, dialog – the return G for that trajectory is computed, and the gradients of the probabilities of the actions taken with respect to the model weights are computed”).
Palanisamy in view of Williams does not specifically teaches wherein incorporates an imaginary long-term reward of an augmented action and the one or more constraints to adjust the raw probability.
Kantor teaches wherein the policy gradient incorporates an imaginary long-term reward of an augmented action and the one or more constraints to adjust the raw probability ([Kantor, 0042] “In particular, a decision making problem may be a Markov Decision process (MDP) with finite state and action spaces. In general, a finite MDP may be expressed as a tuple (X, A, R, D, custom-character, custom-character.sub.0) where X={1, . . . , n, x.sub.Ter} and A={1, . . . , m} are the state and action spaces, respectively, and x.sub.Ter is a recurrent terminal state. For a state x and an action a, R(x, a) may be a bounded reward function, and D.sub.1(x, a), . . . , D.sub.n(x, a) the constraints cost function. custom-character(.Math.|x, a) may be the transition probability distribution, and P.sub.0(.Math.) to be the initial state distribution. A stationary policy μ(.Math.|x) for an MDP is a probability distribution over actions, conditioned on the current state. In policy gradient methods, such policies can be parameterized by a k-dimensional vector θ, using this notation we can write the space of policies as μ(.Math.|x; θ), x∈ X, θ∈ custom-character.sup.k”, teaches the policy gradient method using reward function, and constraint cost functions to adjust the stationary policy which is a probability distribution over actions).

Claim 5 is rejected under 35 U.S.C. 103 over Palanisamy (US 20200033869 A1) in view of Williams (Williams, 2017, “Hybrid Code Networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning”) and further in view of Jiang (CN 104899463 A).

Regarding claim 5, Palanisamy in view of Williams teaches the predicting one or more recommended probability of each action relates to operation ([William, page 666, right column, second paragraph; 6-10 of Figure 1] “The feature components from steps 1-5 are concatenated to form a feature vector (step 6). This vector is passed to an RNN, such as a long shortterm memory (LSTM) (Hochreiter and Schmidhuber, 1997) or gated recurrent unit (GRU) (Chung et al., 2014). The RNN computes a hidden state (vector), which is retained for the next timestep (step 8), and passed to a dense layer with a softmax activation, with output dimension equal to the number of distinct system action templates (step 9).1 Thus the output of step 9 is a distribution over action templates. Next, the action mask is applied as an element-wise multiplication, and the result is normalized back to a probability distribution (step 10) – this forces non-permitted actions to take on probability zero. From the resulting distribution (step 11), an action is selected (step 12). When RL is active, exploration is required, so in this case an action is sampled from the distribution; when RL is not active, the best action should be chosen, and so the action with the highest probability is always selected”). 
Palanisamy in view of Williams does not specifically teach wherein the process is a blast furnace process for production of molten steel, the metric is a chemical composition metric of the molten steel, and each action relates to operation of a fuel injection rate of the blast furnace. 
Jiang teaches wherein the process is a blast furnace process for production of molten steel, the metric is a chemical composition metric of the molten steel, and each action relates to operation of a fuel injection rate of the blast furnace ([Jiang, page 4, line 33-43] “performing the correlation analysis, changes in correlation with the content of molten iron silicon which strong variable as the input variable of the model of the invention, molten iron silicon content as an output variable to all collected influence molten iron silicon content change of variable and molten iron silicon content between. Because the complex physical and chemical reaction in the blast furnace, which indirectly influence factor with many molten iron silicon content, comprising a distributing way at the upper part, material properties, control parameter of the part such as the wind quantity, wind temperature and so on. The invention when building the model, the influence of molten iron silicon content of strong change of variables as the model input variables, the hot metal silicon content as the model output variable. change the solid raw material silicon content in molten iron, comprising an iron ore, sintered ore and coke; the gaseous material to be heated, comprising air and some auxiliary fuel, and the variation of the wind quantity, wind temperature of lower part parameter has close relationship. Table 1 lists the 20 variables to be selected”, Jiang analyze the correlationship between chemical content of iron silicon and wind quantity, temperature, auxiliary fuel… 20 variables).
	Before the effective filing date of the invention to a person of ordinary skill in the art, it would
have been obvious, having the teachings of Jiang, Palanisamy and Williams to use the data from the blast furnace process for production of molten steel of Jiang to implement the method of training a reinforcement learning system of Williams and Palanisamy. The suggestion and/or motivation to do so is to optimize the molten steel production system by using reinforcement learning system.

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure.
Regarding Reinforcement Learning.
US 20200257968 A1
US 10424302 B2
US 20190286979 A1
US 20200033869 A1
Any inquiry concerning this communication or earlier communications from the examiner
should be directed to JUN KWON whose telephone number is (571)272-2072. The examiner can
normally be reached on 7:30 AM - 5:30 PM. If attempts to reach the examiner by telephone are
unsuccessful, the examiner’s supervisor, Abdullah Kawsar can be reached on (571)270-3169. The fax
phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application
Information Retrieval (PAIR) system. Status information for published applications may be obtained
from either Private PAIR or Public PAIR. Status information for unpublished applications is available
through Private PAIR only. For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic
Business Center (EBC) at 866-217-9197 (toll-free).
/JUN KWON/
Patent Examiner, Art Unit 2127
/ABDULLAH AL KAWSAR/Supervisory Patent Examiner, Art Unit 2127