Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
This is the initial office action that has been issued in response to patent application 16/542,328 filed on 08/16/2019. Claims 1-20, as originally filed, are currently pending and have been considered below. Claim 1 and 13 are independent claims.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-20 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Regarding claim 1,
Claim 1 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 1 is directed to a method, which is directed to a process, one of the statutory categories. See MPEP 2106.03.
Step 2A Prong One Analysis: Each of the following limitation(s):     
… select at least one symptom inquiry action;  
… select at least one medical test action from candidate test actions according to the initial symptom and the at least one symptom answer;  
… select a result prediction action from candidate prediction actions according to the initial symptom, the at least one symptom answer and the at least one test result
 
as drafted, under its broadest reasonable interpretation, covers mental processes corresponding to an evaluation or judgement.
Step 2A Prong Two Analysis: This judicial exception is not integrated into a practical application. In particular, the claim only recites additional elements that are mere instructions to implement an abstract idea on a computer, or merely uses a computer or other machinery as a tool to perform an abstract idea. See MPEP 2106.05(f). The recitation of additional element(s) of “utilizing a neural network model to”, as drafted, is reciting generic computer components at a high-level of generality (i.e., as a generic computer component performing a generic computer function) such that it amounts no more than mere instructions to apply the exception using a generic computer component or other machinery. Further, the limitations of “receiving an initial symptom”, “receiving at least one symptom answer in response to the at least one symptom inquiry action”, and “receiving at least one test result of the at least one medical test action”, which can be considered as mere data gathering. See MPEP 2106.05(g). Accordingly, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element of using generic computer components to perform the abstract idea amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a mere instruction to apply language cannot provide an inventive concept. Further, the insignificant extra-solution activity of “receiving an initial symptom”, “receiving at least one symptom answer in response to the at least one symptom inquiry action”, and “receiving at least one test result of the at least one medical test action” are considered well known, routine, and conventional because of what is recited in the MPEP 2106.05(d)(II): “The courts have recognized the following computer functions as well‐ understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity... i. Receiving or transmitting data over a network, e.g., using the Internet to gather data”. Therefore, these additional elements do not amount to significantly more. The claim in not patent eligible.
Regarding claim 13,
Claim 13 is rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more.
Step 1 Analysis: Claim 13 is directed to a system, which is directed to a machine, one of the statutory categories. See MPEP 2106.03.
Step 2A Prong One Analysis: Each of the following limitation(s):       
… select at least one symptom inquiry action according to the initial symptom;  
… select at least one medical test action from candidate test actions according to the initial symptom and the at least one symptom answer,  
… select a result prediction action from candidate prediction actions according to the initial symptom, the at least one symptom answer and the at least one test result
 
as drafted, under its broadest reasonable interpretation, covers mental processes corresponding to an evaluation or judgement.
Step 2A Prong Two Analysis: This judicial exception is not integrated into a practical application. In particular, the claim only recites additional elements that are mere instructions to implement an abstract idea on a computer, or merely uses a computer or other machinery as a tool to perform an abstract idea. See MPEP 2106.05(f). The recitation of additional element(s) of “a decision agent interacting with the interaction system”, “a neural network model, utilized by the decision agent to”, and “wherein the neural network model is utilized by the decision agent to”, as drafted, is reciting generic computer components at a high-level of generality (i.e., as a generic computer component performing a generic computer function) such that it amounts no more than mere instructions to apply the exception using a generic computer component or other machinery. Further, the limitations of “an interaction system, configured for receiving an initial symptom”, “wherein the interaction system is configured to receive at least one symptom answer in response to the at least one symptom inquiry action”, and “wherein the interaction system is configured to receive at least one test result of the at least one medical test action”, which can be considered as mere data gathering. See MPEP 2106.05(g). Accordingly, the additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea. The claim is directed to an abstract idea.
Step 2B Analysis: The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to integration of the abstract idea into a practical application, the additional element of using generic computer components to perform the abstract idea amounts to no more than mere instructions to apply the exception using a generic computer component. Mere instructions to apply an exception using a mere instruction to apply language cannot provide an inventive concept. Further, the insignificant extra-solution activity of “an interaction system, configured for receiving an initial symptom”, “wherein the interaction system is configured to receive at least one symptom answer in response to the at least one symptom inquiry action”, and “wherein the interaction system is configured to receive at least one test result of the at least one medical test action” are considered well known, routine, and conventional because of what is recited in the MPEP 2106.05(d)(II): “The courts have recognized the following computer functions as well‐ understood, routine, and conventional functions when they are claimed in a merely generic manner (e.g., at a high level of generality) or as insignificant extra-solution activity... i. Receiving or transmitting data over a network, e.g., using the Internet to gather data”. Therefore, these additional elements do not amount to significantly more. The claim in not patent eligible.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Shousha et al. (US10468142B1) in view of Graepel et al. (US20180032864A1)
Regarding Claim 1,
Shousha et al. teaches a control method, suitable for a medical system, the control method comprising (Shousha et al., Col. 10 Lines 58-60, “AI-based systems and methods for corneal diagnosis, such as assisting in the diagnosis or treatment of corneal disease or conditions” teaches methods and systems for corneal diagnosis). 
receiving an initial symptom (Shousha et al., Col. 11 Lines 25-32, “the system 10 receives data input which is processed through an AI models 12 to generate an output prediction… The data input may include various data such as color and/or high resolution images of the cornea or anterior segment” teaches receiving various data of the cornea or anterior segment as input data (corresponds to initial symptom)). 
Shousha et al. does not appear to explicitly teach utilizing a neural network model to select at least one symptom inquiry action; receiving at least one symptom answer in response to the at least one symptom inquiry action; utilizing the neural network model to select at least one medical test action from candidate test actions according to the initial symptom and the at least one symptom answer; receiving at least one test result of the at least one medical test action; and utilizing the neural network model to select a result prediction action from candidate prediction actions according to the initial symptom, the at least one symptom answer and the at least one test result
However, Graepel et al., teaches utilizing a neural network model to select at least one symptom inquiry action (Graepel et al., FIG. 2 and Para. [0054], “FIG. 2 is a flow diagram of an example process 200 for training a collection of neural networks for use in selecting actions” teaches utilizing neural networks to select actions (corresponds to at least one symptom inquiry action)).
receiving at least one symptom answer in response to the at least one symptom inquiry action (Graepel et al., Para. [0015], “the agent interacts with the environment in order to complete one or more objectives and the reinforcement learning system selects actions in order to maximize the objectives, as represented by numeric rewards received by the reinforcement learning system in response to actions performed by the agent” teaches one or more objectives (corresponds to at least one symptom answer) received for completion, in response to the selected actions).
utilizing the neural network model to select at least one medical test action from candidate test actions according to the initial symptom and the at least one symptom answer (Graepel et al., Para. [0014], “the reinforcement learning system receives data characterizing the current state of the environment and selects an action to be performed by the agent from a set of actions in response to the received data. Once the action has been selected by the reinforcement learning system, the agent performs the action to interact with the environment” teaches selecting an action (corresponds to selecting at least one medical test action) from a set of actions (corresponds to candidate test actions) according to the received data (corresponds to the initial symptom) and the performance objective (corresponds to the at least one symptom answer)).
receiving at least one test result of the at least one medical test action (Graepel et al., Para. [0018], “the actions in the set of actions are possible medical treatments for the patient and the objectives can include one or more of maintaining a current health of the patient, improving the current health of the patient, minimizing medical expenses for the patient, and so on” teaches the objectives (corresponds to the at least one test result) of the possible medical treatments (corresponds to the at least one medical test action) for the patient) .
utilizing the neural network model to select a result prediction action from candidate prediction actions according to the initial symptom, the at least one symptom answer and the at least one test result (Graepel et al., Para. [0066], “the system trains the RL policy neural network to adjust the values of the parameters of the RL policy neural network to generate action probabilities that represent, for each action, the likelihood that for each action, a predicted likelihood that a long-term reward that will be received will be maximized if the action is performed by the agent in response to the observation instead of any other action in the set of possible action” teaches utilizing a neural network to selecting an action probability prediction (corresponds to a result prediction action) from a plurality of actions according to the observation (corresponds to the initial symptom), the objectives, and performance objective).
It would have been obvious to one of ordinary skill in the art before the effective filing data of the claimed invention to utilize a reinforcement learning system for a medical course of action, as taught by Graepel et al., to a control method, suitable for a medical system of Shousha et al. The motivation to maximize the objectives (Graepel et al., Para. [0015], “the agent interacts with the environment in order to complete one or more objectives and the reinforcement learning system selects actions in order to maximize the objectives, as represented by numeric rewards received by the reinforcement learning system in response to actions performed by the agent”).
Regarding Claim 2,
The Shousha et al. in view of Graepel et al. combination of claim 1 teaches the control method as claimed in claim 1, wherein the control method comprising: 
The combination, as described in the rejection of claim 1, further teaches obtaining training data comprising a plurality of medical records, each one of the medical records comprising a diagnosed disease and a plurality of medical test results performed for diagnosing the diagnosed disease (Graepel et al., Para. [0056], “The labeled training data for the SL policy neural network includes multiple training observations and, for each training observation, an action label that identifies an action that was performed in response to the training observation” teaches the training data including training observations. Para. [0026], “when the environment 104 is a patient diagnosis environment, the observations may be data from an electronic medical record of a current patient” teaches the observations comprising the medical record of patients. Para. [0018], “the environment may be a patient diagnosis environment such that each state is a respective patient state of a patient, i.e., as reflected by health data characterizing the health of the patient” teaches the environment being patient diagnosis environment with state of a patient (corresponds to a diagnosed disease and a plurality of medical test results performed for diagnosing the diagnosed disease)).
utilizing the neural network model to select the at least one medical test action from the candidate test actions and to select the result prediction action from the candidate prediction actions according to the training data (Graepel et al., Para. [0014], “the reinforcement learning system receives data characterizing the current state of the environment and selects an action to be performed by the agent from a set of actions in response to the received data. Once the action has been selected by the reinforcement learning system, the agent performs the action to interact with the environment” teaches selecting an action (corresponds to selecting at least one medical test action) from a set of actions (corresponds to candidate test actions) according to the received data (corresponds to training data). Para. [0066], “the system trains the RL policy neural network to adjust the values of the parameters of the RL policy neural network to generate action probabilities that represent, for each action, the likelihood that for each action, a predicted likelihood that a long-term reward that will be received will be maximized if the action is performed by the agent in response to the observation instead of any other action in the set of possible action” teaches utilizing a neural network to selecting an action probability prediction (corresponds to a result prediction action) from a plurality of actions according to the observation (corresponds to training data)).
providing a test cost penalty according to the at least one medical test action (Graepel et al., Para. [0076], “The system then trains the value neural network on the training observations using supervised learning to determine trained values of the parameters of the value neural network from initial values of the parameters of the neural network. For example, the system can train the value neural network using asynchronous gradient descent to minimize the mean squared error between the value scores and the actual long-term reward received” teaches the mean squared error (corresponds to a test cost penalty) according to the scores of the training observations (corresponds to the at least one medical test action)).
providing a test abnormality reward according to the medical test results in the training data corresponding to the at least one medical test action (Graepel et al., Para. [0067], “the system completes an episode of interaction of the agent while the actions were being selected using the RL policy neural network and then generates a long-term reward for the episode. The system generates the long-term reward based on the outcome of the episode, i.e., on whether the objectives were completed during the episode. For example, the system can set the reward to one value if the objectives were completed and to another, lower value if the objectives were not completed” teaches a long-term reward (corresponds to a test abnormality reward) according to whether the objectives were completed (corresponds to the medical test results in the training data), based on the actions (corresponds to the at least one medical test action)).
providing a prediction reward according to a comparison between the result prediction action and the diagnosed disease in the medical records (Graepel et al., Para. [0071], “the system trains the value neural network to generate a value score for a given state of the environment that represents the predicted long-term reward resulting from the environment being in the state by adjusting the values of the parameters of the value neural network” teaches the predicted long-term reward (corresponds to a prediction reward) based on the value score for a given state of the environment (corresponds to a comparison between the result prediction action and the diagnosed disease in the medical records)). 
training the neural network model to maximize cumulative rewards in reference with the test abnormality reward, the prediction reward and the test cost penalty (Graepel et al., Para. [0066], “the system trains the RL policy neural network to adjust the values of the parameters of the RL policy neural network to generate action probabilities that represent, for each action, the likelihood that for each action, a predicted likelihood that a long-term reward that will be received will be maximized if the action is performed by the agent in response to the observation instead of any other action in the set of possible actions. Generally, the long-term reward is a numeric value that is dependent on the degree to which the one or more objectives are completed during interaction of the agent with the environment” teaches training the neural network to maximize the long-term reward received (corresponds to maximize cumulative rewards)).
Regarding Claim 3,
The Shousha et al. in view of Graepel et al. combination of claim 2 teaches the control method as claimed in claim 2,  
The combination, as described in the rejection of claim 2, further teaches wherein the each one of the medical records further comprises a plurality of diagnosed symptoms related to the diagnosed disease (Graepel et al., Para. [0026], “when the environment 104 is a patient diagnosis environment, the observations may be data from an electronic medical record of a current patient” teaches the observations comprising the medical record of patients. Para. [0018], “the environment may be a patient diagnosis environment such that each state is a respective patient state of a patient, i.e., as reflected by health data characterizing the health of the patient” teaches the environment being patient diagnosis environment with state of a patient (corresponds to a plurality of diagnosed symptoms related to the diagnosed disease)).
the neural network model is further utilized for selecting a plurality of symptom inquiry actions before the medical test action and the result prediction action (Graepel et al., FIG. 3 and Para. [0079], “prior to selecting the action to be performed by the agent in response to the current observation, the system searches or continues to search the state tree until an action is to be selected (step 306). That is, in some implementations, the system is allotted a certain time period after receiving the observation to select an action. In these implementations, the system continues performing searches as described below with reference to FIG. 4, starting from the current node in the state tree until the allotted time period elapses. The system can then update the state tree and the edge data based on the searches before selecting an action in response to the current observation. In some of these implementations, the system searches or continues searching only if the edge data indicates that the action to be selected may be modified as a result of the additional searching” teaches selecting actions based on the observation. A search is performed until an action is selected (corresponds to selecting a plurality of symptom inquiry actions before) and then select action using the edge data (corresponds to the medical test action and the result prediction action)).
Regarding Claim 4,
The Shousha et al. in view of Graepel et al. combination of claim 3 teaches the control method as claimed in claim 3, further comprising:
The combination, as described in the rejection of claim 3, further teaches determining a first input state comprising symptom inquiry answers of the symptom inquiry actions, wherein the symptom inquiry answers are determined according to the diagnosed symptoms in the medical record of the training data (Graepel et al., Para. Para. [0050], “the action selection subsystem 120 repeatedly performs searches of the state tree to update the tree and edge data. Performing a search of the state tree to update the state tree” teaches repeatedly determining the state (corresponds to determining a first input state). [0022], “the reinforcement learning system 100 receives observations, with each observation being data characterizing a respective state of the environment 104” teaches receiving observations data characterizing a respective state (corresponds to first input state comprising symptom inquiry answers of the symptom inquiry actions). Para. [0026], “when the environment 104 is a patient diagnosis environment, the observations may be data from an electronic medical record of a current patient” teaches the observations (corresponds to the symptom inquiry answers) are determined based on the patient’s medical record).
selecting the at least one medical test action according to the first input state (Graepel et al., Para. [0014], “the reinforcement learning system receives data characterizing the current state of the environment and selects an action to be performed by the agent from a set of actions in response to the received data. Once the action has been selected by the reinforcement learning system, the agent performs the action to interact with the environment” teaches selecting an action (corresponds to selecting at least one medical test action) from a set of actions according to the received data characterizing the current state (corresponds to the first input state). 
Regarding Claim 5,
The Shousha et al. in view of Graepel et al. combination of claim 4 teaches the control method as claimed in claim 4, further comprising: 
The combination, as described in the rejection of claim 4, further teaches determining a second input state comprising the symptom inquiry answers and at least one medical test answer corresponding to the at least one medical test action (Graepel et al., Para. [0050], “the action selection subsystem 120 repeatedly performs searches of the state tree to update the tree and edge data. Performing a search of the state tree to update the state tree” teaches repeatedly determining the state (corresponds to determining a second input state). Para. [0022], “the reinforcement learning system 100 receives observations, with each observation being data characterizing a respective state of the environment 104” teaches receiving observations data characterizing a respective state (corresponds to second input state comprising symptom inquiry answers of the symptom inquiry actions). Para. [0026], “when the environment 104 is a patient diagnosis environment, the observations may be data from an electronic medical record of a current patient” teaches the observations (corresponds to the symptom inquiry answers) are determined based on the patient’s medical record).
selecting the result prediction action according to the second input state (Graepel et al., Para. [0066], “the system trains the RL policy neural network to adjust the values of the parameters of the RL policy neural network to generate action probabilities that represent, for each action, the likelihood that for each action, a predicted likelihood that a long-term reward that will be received will be maximized if the action is performed by the agent in response to the observation instead of any other action in the set of possible action” teaches utilizing a neural network to selecting an action probability prediction (corresponds to a result prediction action) from a plurality of actions according to the observation (corresponds to the second input state)).
Regarding Claim 6,
The Shousha et al. in view of Graepel et al. combination of claim 4 teaches the control method as claimed in claim 4, 
The combination, as described in the rejection of claim 4, further teaches wherein a combination of medical test actions are selected from the candidate test actions simultaneously according to the first input state (Graepel et al., Para. [0014], “the reinforcement learning system receives data characterizing the current state of the environment and selects an action to be performed by the agent from a set of actions in response to the received data. Once the action has been selected by the reinforcement learning system, the agent performs the action to interact with the environment” teaches selecting an action (corresponds to selecting at least one medical test action) from a set of actions (corresponds to candidate test actions) according to the received data and the performance objective (corresponds to the first input state)).
Regarding Claim 7,
The Shousha et al. in view of Graepel et al. combination of claim 6 teaches the control method as claimed in claim 6, further comprising:
The combination, as described in the rejection of claim 6, further teaches generating probability values of the candidate test actions and complement probability values of the candidate test actions by the neural network model according to the first input state (Graepel et al., Para. [0066], “the system trains the RL policy neural network to adjust the values of the parameters of the RL policy neural network to generate action probabilities that represent, for each action, the likelihood that for each action… in response to the observation instead of any other action in the set of possible actions” teaches generating the values of the action probabilities (corresponds to probability values of the candidate test actions) in response to the observation (corresponds to the candidate test actions) of the neural network).
determining a plurality of weights of all combinations of the candidate test actions according to the probability values and the complement probability values (Graepel et al., Para. [0071], “the system trains the value neural network to generate a value score for a given state of the environment that represents the predicted long-term reward resulting from the environment being in the state by adjusting the values of the parameters of the value neural network” teaches determining the values of the parameters (corresponds to a plurality of weights of all combinations) of the environment observation (corresponds to the candidate test actions) according to value score (corresponds to the probability values and the complement probability values)). 
selecting the combination of medical test actions from the all combinations of the candidate test actions in reference with the weights (Graepel et al., Para. [0014], “the reinforcement learning system receives data characterizing the current state of the environment and selects an action to be performed by the agent from a set of actions in response to the received data. Once the action has been selected by the reinforcement learning system, the agent performs the action to interact with the environment” teaches selecting an action (corresponds to selecting at least one medical test action) from a set of actions (corresponds to candidate test actions) according to the received data and the performance objective. Para. [0053], “selects the action to be performed by the agent 102 using the current edge data for the edges that are outgoing from the node in the state tree that represents the state characterized by the observation” teaches selecting actions based on the edge data (corresponds to the weights)).
Regarding Claim 8,
The Shousha et al. in view of Graepel et al. combination of claim 3 teaches the control method as claimed in claim 3,
The combination, as described in the rejection of claim 3, further teaches wherein the neural network model comprises a common neural network portion, a first branch neural network portion, a second branch neural network portion and a third branch neural network portion, wherein the first branch neural network portion, the second branch neural network portion and the third branch neural network portion are respectively connected to the common neural network portion (Graepel et al., FIG. 1 and Para. [0028], “the reinforcement learning system 100 selects actions using a collection of neural networks that includes at least one policy neural network, e.g., a supervised learning (SL) policy neural network 140, a reinforcement learning (RL) policy neural network 150, or both, a value neural network 160, and, optionally, a fast rollout neural network 130” teaches the reinforcement system comprising of multiple neural network portions that are connected).
wherein a first result state generated by the first branch neural network portion is utilized to select the symptom inquiry actions, a second result state generated by the second branch neural network portion is utilized to select the at least one medical test action, and a third result state generated by the third branch neural network portion is utilized to select the result prediction action (Para. [0094], “the system receives rollout data characterizing the state and processes the rollout data using the fast rollout policy neural network that has been trained to receive the rollout data to generate a respective rollout action probability for each action in the set of possible actions. In some implementations, the system then selects the action having a highest rollout action probability as the action to be performed by the agent in response to the rollout data characterizing the state” teaches a rollout policy neural network is utilized to select actions that are performed by the agent (corresponds to the symptom inquiry actions), based on rollout data characterizing the state (corresponds to a first result state). Para. [0014], “the reinforcement learning system receives data characterizing the current state of the environment and selects an action to be performed by the agent from a set of actions in response to the received data. Once the action has been selected by the reinforcement learning system, the agent performs the action to interact with the environment” teaches selecting an action (corresponds to selecting at least one medical test action) from a set of actions according to the received data characterizing the current state of the environment (corresponds to the second result state). Para. [0066], “the system trains the RL policy neural network to adjust the values of the parameters of the RL policy neural network to generate action probabilities that represent, for each action, the likelihood that for each action, a predicted likelihood that a long-term reward that will be received will be maximized if the action is performed by the agent in response to the observation instead of any other action in the set of possible action” teaches utilizing a neural network to selecting an action probability prediction (corresponds to a result prediction action) from a plurality of actions according to the observation (corresponds to the third result state)).
Regarding Claim 9,
The Shousha et al. in view of Graepel et al. combination of claim 8 teaches the control method as claimed in claim 8,
The combination, as described in the rejection of claim 8, further teaches wherein the first branch neural network portion and the third branch neural network portion adopt first activation functions, and the second branch neural network portion adopts a second activation function different from the first activation functions (Shousha et al., Col.  20 Lines 41-43, “the AI model 12 may employ deep learning models comprising one or more neural networks” teaches multiple neural network (corresponding to branch neural network portions). Col. 21 Lines 44-62, “these layers identified herein may include such layers that apply activation functions such as ReLU, TanH, or sigmoid …an output layer includes a classifier that applies an activation function such as regression/logical regression, linear, SVM, or SoftMax” teaches different activation functions).
Regarding Claim 10,
The Shousha et al. in view of Graepel et al. combination of claim 9 teaches the control method as claimed in claim 9,
The combination, as described in the rejection of claim 9, further teaches wherein the first activation function is a Softmax function (Shousha et al., Col. 21 Lines 60-62, “an output layer includes a classifier that applies an activation function such as… SoftMax” teaches the activation in a deep neural network being a softmax function).
the second activation function is a Sigmoid function (Shousha et al., Col. 21 Lines 42-46, “Layers to increase nonlinearity include activation functions and will be associated with layers such as convolutional layers or fully connected layers. Thus, these layers identified herein may include such layers that apply activation functions such as… sigmoid” teaches the activation in a deep neural network being a sigmoid function)
Regarding Claim 11,
The Shousha et al. in view of Graepel et al. combination of claim 2 teaches the control method as claimed in claim 2, further comprising:
The combination, as described in the rejection of claim 2, further teaches providing a label-guided exploration probability (Graepel et al., Para. [0038], “When used by the system 100 in selecting actions, the neural network training subsystem 110 trains the fast rollout neural network 130 and the SL policy neural network 140 on labeled training data using supervised learning and trains the RL policy neural network 150 and the value neural network 160 based on interactions of the agent 102 with a simulated version of the environment 104” teaches training neural networks of the system with labeled training data (corresponds to a label-guided exploration probability)).
in response to that a random value matches the label-guided exploration probability, providing the diagnosed disease in the medical records to the neural network model as the result prediction action for guiding the neural network model (Graepel et al., Para. [0038], “the system trains the SL policy neural network to generate action probabilities that match the action labels for the labeled training data by adjusting the values of the parameters of the SL policy neural network from initial values of the parameters to trained values of the parameters. For example, the system can train the SL policy neural network using asynchronous stochastic gradient descent updates to maximize the log likelihood of the action identified by the action label for a given training observation” teaches an action probability (corresponds to random value) that matches the action labels for the labeled training data (corresponds to label-guided exploration probability) by providing the training observation and values of parameter (corresponds to the diagnosed disease in the medical records) as an action probability prediction (corresponds to a result prediction action) from a plurality of actions).
in response to the random value fails to match the label-guided exploration probability, selecting the result prediction action from candidate prediction actions according to the neural network model (Graepel et al., Para. [0038], “the system trains the SL policy neural network to generate action probabilities that match the action labels for the labeled training data by adjusting the values of the parameters of the SL policy neural network from initial values of the parameters to trained values of the parameters. For example, the system can train the SL policy neural network using asynchronous stochastic gradient descent updates to maximize the log likelihood of the action identified by the action label for a given training observation” teaches an action probability (corresponds to random value) that matches the action labels for the labeled training data (corresponds to label-guided exploration probability) by providing the training observation and values of parameter (corresponds to the diagnosed disease in the medical records). Para. [0066], “the system trains the RL policy neural network to adjust the values of the parameters of the RL policy neural network to generate action probabilities that represent, for each action, the likelihood that for each action, a predicted likelihood that a long-term reward that will be received will be maximized if the action is performed by the agent in response to the observation instead of any other action in the set of possible action” teaches utilizing a neural network to selecting an action probability prediction (corresponds to a result prediction action) from a plurality of actions (corresponds to candidate prediction actions)).
Regarding Claim 12,
The Shousha et al. in view of Graepel et al. combination of claim 1 teaches the control method as claimed in claim 1,
The combination, as described in the rejection of claim 1, further teaches wherein the result prediction action comprises at least one of a disease prediction action and a medical department recommendation action corresponding to the disease prediction action (Para. [0029], “a policy neural network is a neural network that is configured to receive an observation and to process the observation in accordance with parameters of the policy neural network to generate a respective action probability for each action in the set of possible actions that can be performed by the agent to interact with the environment” teaches an action probability prediction (corresponds to the result prediction action) that comprises observation and interaction with the environment. Para. [0026], “when the environment 104 is a patient diagnosis environment, the observations may be data from an electronic medical record of a current patient” teaches the environment is a patient diagnosis environment (corresponds to at least one of a disease prediction action). Para. [0018], “the environment may be a patient diagnosis environment such that each state is a respective patient state of a patient, i.e., as reflected by health data characterizing the health of the patient, and the agent may be a computer system for suggesting treatment for the patient. In this example, the actions in the set of actions are possible medical treatments for the patient and the objectives can include one or more of maintaining a current health of the patient, improving the current health of the patient, minimizing medical expenses for the patient, and so on” teaches the actions in the set of actions are possible medical treatments for the patient (corresponds to a medical department recommendation action) that corresponds to the patient diagnosis environment (corresponds to the disease prediction action)). 
Regarding Claim 13,
Shousha et al. teaches a medical system, comprising (Shousha et al., Col. 11 Lines 35-40, “the system 10 is configured to predict one or more diseases or conditions, if any, in the human cornea and/or predict the severity of disease or condition using an input image… input data may include patient data such as demographic data or medical data” teaches a medical system).
an interaction system, configured for receiving an initial symptom (Shousha et al., Col. 11 Lines 25-32, “the system 10 receives data input which is processed through an AI models 12 to generate an output prediction… The data input may include various data such as color and/or high resolution images of the cornea or anterior segment” teaches receiving various data of the cornea or anterior segment as input data (corresponds to initial symptom)).
Shousha et al. does not appear to explicitly teach a decision agent interacting with the interaction system; and a neural network model, utilized by the decision agent to select at least one symptom inquiry action according to the initial symptom; wherein the interaction system is configured to receive at least one symptom answer in response to the at least one symptom inquiry action, wherein the neural network model is utilized by the decision agent to select at least one medical test action from candidate test actions according to the initial symptom and the at least one symptom answer, wherein the interaction system is configured to receive at least one test result of the at least one medical test action, and wherein the neural network model is utilized by the decision agent to select a result prediction action from candidate prediction actions according to the initial symptom, the at least one symptom answer and the at least one test result
However, Graepel et al., teaches a decision agent interacting with the interaction system (Graepel et al., Para. [0003], “Reinforcement learning agents interact with an environment by receiving an observation that characterizes the current state of the environment, and in response, performing an action” teaches reinforcement learning agents interacting with an environment (corresponds to the interaction system). 
a neural network model, utilized by the decision agent to select at least one symptom inquiry action according to the initial symptom (Graepel et al., FIG. 2 and Para. [0054], “FIG. 2 is a flow diagram of an example process 200 for training a collection of neural networks for use in selecting actions to be performed by an agent interacting with an environment” teaches utilizing neural networks to select actions (corresponds to at least one symptom inquiry action) according to the received observation (corresponds to the initial symptom).
wherein the interaction system is configured to receive at least one symptom answer in response to the at least one symptom inquiry action (Graepel et al., Para. [0015], “the agent interacts with the environment in order to complete one or more objectives and the reinforcement learning system selects actions in order to maximize the objectives, as represented by numeric rewards received by the reinforcement learning system in response to actions performed by the agent” teaches one or more objectives (corresponds to at least one symptom answer) received for completion, in response to the selected actions).
wherein the neural network model is utilized by the decision agent to select at least one medical test action from candidate test actions according to the initial symptom and the at least one symptom answer (Graepel et al., Para. [0014], “the reinforcement learning system receives data characterizing the current state of the environment and selects an action to be performed by the agent from a set of actions in response to the received data. Once the action has been selected by the reinforcement learning system, the agent performs the action to interact with the environment” teaches selecting an action (corresponds to selecting at least one medical test action) from a set of actions (corresponds to candidate test actions) according to the received data (corresponds to the initial symptom) and the performance objective (corresponds to the at least one symptom answer)).
wherein the interaction system is configured to receive at least one test result of the at least one medical test action (Graepel et al., Para. [0018], “the actions in the set of actions are possible medical treatments for the patient and the objectives can include one or more of maintaining a current health of the patient, improving the current health of the patient, minimizing medical expenses for the patient, and so on” teaches the objectives (corresponds to the at least one test result) of the possible medical treatments (corresponds to the at least one medical test action) for the patient). 
wherein the neural network model is utilized by the decision agent to select a result prediction action from candidate prediction actions according to the initial symptom, the at least one symptom answer and the at least one test result (Graepel et al., Para. [0066], “the system trains the RL policy neural network to adjust the values of the parameters of the RL policy neural network to generate action probabilities that represent, for each action, the likelihood that for each action, a predicted likelihood that a long-term reward that will be received will be maximized if the action is performed by the agent in response to the observation instead of any other action in the set of possible action” teaches utilizing a neural network to selecting an action probability prediction (corresponds to a result prediction action) from a plurality of actions according to the observation (corresponds to the initial symptom), the objectives, and performance objective).
Regarding Claim 14,
The Shousha et al. in view of Graepel et al. combination of claim 13 teaches the medical system as claimed in claim 13, wherein the medical system further comprises:
The combination, as described in the rejection of claim 13, further teaches a reinforcement learning agent interacting with the interaction system (Graepel et al., Para. [0003], “Reinforcement learning agents interact with an environment by receiving an observation that characterizes the current state of the environment, and in response, performing an action” teaches reinforcement learning agents interacting with an environment (corresponds to the interaction system).
wherein the neural network model is trained by the reinforcement learning agent according to training data, the training data comprising a plurality of medical records, each one of the medical records comprising a diagnosed disease and a plurality of medical test results performed for diagnosing the diagnosed disease (Graepel et al., Para. [0056], “The labeled training data for the SL policy neural network includes multiple training observations and, for each training observation, an action label that identifies an action that was performed in response to the training observation” teaches the training data including training observations. Para. [0026], “when the environment 104 is a patient diagnosis environment, the observations may be data from an electronic medical record of a current patient” teaches the observations comprising the medical record of patients. Para. [0018], “the environment may be a patient diagnosis environment such that each state is a respective patient state of a patient, i.e., as reflected by health data characterizing the health of the patient” teaches the environment being patient diagnosis environment with state of a patient (corresponds to a diagnosed disease and a plurality of medical test results performed for diagnosing the diagnosed disease)).
wherein the neural network model is utilized by the reinforcement learning agent for selecting the at least one medical test action and to select the result prediction action (Graepel et al., Para. [0014], “the reinforcement learning system receives data characterizing the current state of the environment and selects an action to be performed by the agent from a set of actions in response to the received data. Once the action has been selected by the reinforcement learning system, the agent performs the action to interact with the environment” teaches selecting an action (corresponds to selecting at least one medical test action) from a set of actions (corresponds to candidate test actions) according to the received data (corresponds to training data). Para. [0066], “the system trains the RL policy neural network to adjust the values of the parameters of the RL policy neural network to generate action probabilities that represent, for each action, the likelihood that for each action, a predicted likelihood that a long-term reward that will be received will be maximized if the action is performed by the agent in response to the observation instead of any other action in the set of possible action” teaches utilizing a neural network to selecting an action probability prediction (corresponds to a result prediction action) from a plurality of actions according to the observation (corresponds to training data)).
wherein the interaction system provides a test abnormality reward to the reinforcement learning agent according to the medical test results in the training data corresponding to the at least one medical test action (Graepel et al., Para. [0067], “the system completes an episode of interaction of the agent while the actions were being selected using the RL policy neural network and then generates a long-term reward for the episode. The system generates the long-term reward based on the outcome of the episode, i.e., on whether the objectives were completed during the episode. For example, the system can set the reward to one value if the objectives were completed and to another, lower value if the objectives were not completed” teaches a long-term reward (corresponds to a test abnormality reward) according to whether the objectives were completed (corresponds to the medical test results in the training data), based on the actions (corresponds to the at least one medical test action)).
wherein the interaction system provides a test cost penalty to the reinforcement learning agent according to the at least one medical test action (Graepel et al., Para. [0076], “The system then trains the value neural network on the training observations using supervised learning to determine trained values of the parameters of the value neural network from initial values of the parameters of the neural network. For example, the system can train the value neural network using asynchronous gradient descent to minimize the mean squared error between the value scores and the actual long-term reward received” teaches the mean squared error (corresponds to a test cost penalty) according to the scores of the training observations (corresponds to the at least one medical test action)).
wherein the interaction system provides a prediction reward to the reinforcement learning agent according to a comparison between the result prediction action and the diagnosed disease in the medical records (Graepel et al., Para. [0071], “the system trains the value neural network to generate a value score for a given state of the environment that represents the predicted long-term reward resulting from the environment being in the state by adjusting the values of the parameters of the value neural network” teaches the predicted long-term reward (corresponds to a prediction reward) based on the value score for a given state of the environment (corresponds to a comparison between the result prediction action and the diagnosed disease in the medical records)).  
wherein the neural network model is trained to maximize cumulative rewards in reference with the test abnormality reward, the prediction reward and the test cost penalty (Graepel et al., Para. [0066], “the system trains the RL policy neural network to adjust the values of the parameters of the RL policy neural network to generate action probabilities that represent, for each action, the likelihood that for each action, a predicted likelihood that a long-term reward that will be received will be maximized if the action is performed by the agent in response to the observation instead of any other action in the set of possible actions. Generally, the long-term reward is a numeric value that is dependent on the degree to which the one or more objectives are completed during interaction of the agent with the environment” teaches training the neural network to maximize the long-term reward received (corresponds to maximize cumulative rewards)).
Regarding Claim 15,
The Shousha et al. in view of Graepel et al. combination of claim 14 teaches the medical system as claimed in claim 14,
The combination, as described in the rejection of claim 14, further teaches wherein the each one of the medical records further comprises a plurality of diagnosed symptoms related to the diagnosed disease (Graepel et al., Para. [0026], “when the environment 104 is a patient diagnosis environment, the observations may be data from an electronic medical record of a current patient” teaches the observations comprising the medical record of patients. Para. [0018], “the environment may be a patient diagnosis environment such that each state is a respective patient state of a patient, i.e., as reflected by health data characterizing the health of the patient” teaches the environment being patient diagnosis environment with state of a patient (corresponds to a plurality of diagnosed symptoms related to the diagnosed disease)).
the neural network model is further utilized for selecting a plurality of symptom inquiry actions before the medical test action and the result prediction action (Graepel et al., FIG. 3 and Para. [0079], “prior to selecting the action to be performed by the agent in response to the current observation, the system searches or continues to search the state tree until an action is to be selected (step 306). That is, in some implementations, the system is allotted a certain time period after receiving the observation to select an action. In these implementations, the system continues performing searches as described below with reference to FIG. 4, starting from the current node in the state tree until the allotted time period elapses. The system can then update the state tree and the edge data based on the searches before selecting an action in response to the current observation. In some of these implementations, the system searches or continues searching only if the edge data indicates that the action to be selected may be modified as a result of the additional searching” teaches selecting actions based on the observation. A search is performed until an action is selected (corresponds to selecting a plurality of symptom inquiry actions before) and then select action using the edge data (corresponds to the medical test action and the result prediction action)).
wherein the interaction system determines a first input state comprising symptom inquiry answers of the symptom inquiry actions, wherein the symptom inquiry answers are determined according to the diagnosed symptoms in the medical record of the training data (Graepel et al., Para. Para. [0050], “the action selection subsystem 120 repeatedly performs searches of the state tree to update the tree and edge data. Performing a search of the state tree to update the state tree” teaches repeatedly determining the state (corresponds to determining a first input state). [0022], “the reinforcement learning system 100 receives observations, with each observation being data characterizing a respective state of the environment 104” teaches receiving observations data characterizing a respective state (corresponds to first input state comprising symptom inquiry answers of the symptom inquiry actions). Para. [0026], “when the environment 104 is a patient diagnosis environment, the observations may be data from an electronic medical record of a current patient” teaches the observations (corresponds to the symptom inquiry answers) are determined based on the patient’s medical record).
wherein the reinforcement learning agent selects the at least one medical test action according to the first input state (Graepel et al., Para. [0014], “the reinforcement learning system receives data characterizing the current state of the environment and selects an action to be performed by the agent from a set of actions in response to the received data. Once the action has been selected by the reinforcement learning system, the agent performs the action to interact with the environment” teaches selecting an action (corresponds to selecting at least one medical test action) from a set of actions according to the received data characterizing the current state (corresponds to the first input state).
wherein the interaction system determines a second input state comprising the symptom inquiry answers and at least one medical test answer corresponding to the at least one medical test action (Graepel et al., Para. [0050], “the action selection subsystem 120 repeatedly performs searches of the state tree to update the tree and edge data. Performing a search of the state tree to update the state tree” teaches repeatedly determining the state (corresponds to determining a second input state). Para. [0022], “the reinforcement learning system 100 receives observations, with each observation being data characterizing a respective state of the environment 104” teaches receiving observations data characterizing a respective state (corresponds to second input state comprising symptom inquiry answers of the symptom inquiry actions). Para. [0026], “when the environment 104 is a patient diagnosis environment, the observations may be data from an electronic medical record of a current patient” teaches the observations (corresponds to the symptom inquiry answers) are determined based on the patient’s medical record). 
wherein the reinforcement learning agent selects the result prediction action according to the second input state (Graepel et al., Para. [0066], “the system trains the RL policy neural network to adjust the values of the parameters of the RL policy neural network to generate action probabilities that represent, for each action, the likelihood that for each action, a predicted likelihood that a long-term reward that will be received will be maximized if the action is performed by the agent in response to the observation instead of any other action in the set of possible action” teaches utilizing a neural network to selecting an action probability prediction (corresponds to a result prediction action) from a plurality of actions according to the observation (corresponds to the second input state)).
Regarding Claim 16,
The Shousha et al. in view of Graepel et al. combination of claim 15 teaches the medical system as claimed in claim 15,
The combination, as described in the rejection of claim 15, further teaches wherein a combination of medical test actions are selected from the candidate test actions simultaneously according to the first input state (Graepel et al., Para. [0014], “the reinforcement learning system receives data characterizing the current state of the environment and selects an action to be performed by the agent from a set of actions in response to the received data. Once the action has been selected by the reinforcement learning system, the agent performs the action to interact with the environment” teaches selecting an action (corresponds to selecting at least one medical test action) from a set of actions (corresponds to candidate test actions) according to the received data and the performance objective (corresponds to the first input state)).
Regarding Claim 17,
The Shousha et al. in view of Graepel et al. combination of claim 16 teaches the medical system as claimed in claim 16, 
The combination, as described in the rejection of claim 16, further teaches wherein the reinforcement learning agent generates probability values of the candidate test actions and complement probability values of the candidate test actions by the neural network model according to the first input state (Graepel et al., Para. [0066], “the system trains the RL policy neural network to adjust the values of the parameters of the RL policy neural network to generate action probabilities that represent, for each action, the likelihood that for each action… in response to the observation instead of any other action in the set of possible actions” teaches generating the values of the action probabilities (corresponds to probability values of the candidate test actions) in response to the observation (corresponds to the candidate test actions) of the neural network).
wherein the reinforcement learning agent determines a plurality of weights of all combinations of the candidate test actions according to the probability values and the complement probability values (Graepel et al., Para. [0071], “the system trains the value neural network to generate a value score for a given state of the environment that represents the predicted long-term reward resulting from the environment being in the state by adjusting the values of the parameters of the value neural network” teaches determining the values of the parameters (corresponds to a plurality of weights of all combinations) of the environment observation (corresponds to the candidate test actions) according to value score (corresponds to the probability values and the complement probability values)).
wherein the reinforcement learning agent selects the combination of medical test actions from the all combinations of the candidate test actions in reference with the weights (Graepel et al., Para. [0014], “the reinforcement learning system receives data characterizing the current state of the environment and selects an action to be performed by the agent from a set of actions in response to the received data. Once the action has been selected by the reinforcement learning system, the agent performs the action to interact with the environment” teaches selecting an action (corresponds to selecting at least one medical test action) from a set of actions (corresponds to candidate test actions) according to the received data and the performance objective. Para. [0053], “selects the action to be performed by the agent 102 using the current edge data for the edges that are outgoing from the node in the state tree that represents the state characterized by the observation” teaches selecting actions based on the edge data (corresponds to the weights)).
Regarding Claim 18,
The Shousha et al. in view of Graepel et al. combination of claim 15 teaches the medical system as claimed in claim 15,
The combination, as described in the rejection of claim 15, further teaches wherein the neural network model comprises a common neural network portion, a first branch neural network portion, a second branch neural network portion and a third branch neural network portion, wherein the first branch neural network portion and the second branch neural network portion and the third branch neural network portion are respectively connected to the common neural network portion (Graepel et al., FIG. 1 and Para. [0028], “the reinforcement learning system 100 selects actions using a collection of neural networks that includes at least one policy neural network, e.g., a supervised learning (SL) policy neural network 140, a reinforcement learning (RL) policy neural network 150, or both, a value neural network 160, and, optionally, a fast rollout neural network 130” teaches the reinforcement system comprising of multiple neural network portions that are connected).
wherein a first result state generated by the first branch neural network portion is utilized to select the symptom inquiry actions, a second result state generated by the second branch neural network portion is utilized to select the at least one medical test action, and a third result state generated by the third branch neural network portion is utilized to select the result prediction action (Para. [0094], “the system receives rollout data characterizing the state and processes the rollout data using the fast rollout policy neural network that has been trained to receive the rollout data to generate a respective rollout action probability for each action in the set of possible actions. In some implementations, the system then selects the action having a highest rollout action probability as the action to be performed by the agent in response to the rollout data characterizing the state” teaches a rollout policy neural network is utilized to select actions that are performed by the agent (corresponds to the symptom inquiry actions), based on rollout data characterizing the state (corresponds to a first result state). Para. [0014], “the reinforcement learning system receives data characterizing the current state of the environment and selects an action to be performed by the agent from a set of actions in response to the received data. Once the action has been selected by the reinforcement learning system, the agent performs the action to interact with the environment” teaches selecting an action (corresponds to selecting at least one medical test action) from a set of actions according to the received data characterizing the current state of the environment (corresponds to the second result state). Para. [0066], “the system trains the RL policy neural network to adjust the values of the parameters of the RL policy neural network to generate action probabilities that represent, for each action, the likelihood that for each action, a predicted likelihood that a long-term reward that will be received will be maximized if the action is performed by the agent in response to the observation instead of any other action in the set of possible action” teaches utilizing a neural network to selecting an action probability prediction (corresponds to a result prediction action) from a plurality of actions according to the observation (corresponds to the third result state)).
Regarding Claim 19,
The Shousha et al. in view of Graepel et al. combination of claim 18 teaches the medical system as claimed in claim 18,
The combination, as described in the rejection of claim 18, further teaches wherein the first branch neural network portion and the third branch neural network portion adopt first activation functions, and the second branch neural network portion adopts a second activation function different from the first activation functions (Shousha et al., Col.  20 Lines 41-43, “the AI model 12 may employ deep learning models comprising one or more neural networks” teaches multiple neural network (corresponding to branch neural network portions). Col. 21 Lines 44-62, “these layers identified herein may include such layers that apply activation functions such as ReLU, TanH, or sigmoid …an output layer includes a classifier that applies an activation function such as regression/logical regression, linear, SVM, or SoftMax” teaches different activation functions).
Regarding Claim 20,
The Shousha et al. in view of Graepel et al. combination of claim 14 teaches the medical system as claimed in claim 14,
The combination, as described in the rejection of claim 14, further teaches wherein the interaction system provides a label-guided exploration probability (Graepel et al., Para. [0038], “When used by the system 100 in selecting actions, the neural network training subsystem 110 trains the fast rollout neural network 130 and the SL policy neural network 140 on labeled training data using supervised learning and trains the RL policy neural network 150 and the value neural network 160 based on interactions of the agent 102 with a simulated version of the environment 104” teaches training neural networks of the system with labeled training data (corresponds to a label-guided exploration probability)).
in response to that a random value matches the label-guided exploration probability, the interaction system provides the diagnosed disease in the medical records to the neural network model as the result prediction action for guiding the neural network model (Graepel et al., Para. [0038], “the system trains the SL policy neural network to generate action probabilities that match the action labels for the labeled training data by adjusting the values of the parameters of the SL policy neural network from initial values of the parameters to trained values of the parameters. For example, the system can train the SL policy neural network using asynchronous stochastic gradient descent updates to maximize the log likelihood of the action identified by the action label for a given training observation” teaches an action probability (corresponds to random value) that matches the action labels for the labeled training data (corresponds to label-guided exploration probability) by providing the training observation and values of parameter (corresponds to the diagnosed disease in the medical records) as an action probability prediction (corresponds to a result prediction action) from a plurality of actions). 
in response to the random value fails to match the label-guided exploration probability, the interaction system selects the result prediction action from candidate prediction actions according to the neural network model (Graepel et al., Para. [0038], “the system trains the SL policy neural network to generate action probabilities that match the action labels for the labeled training data by adjusting the values of the parameters of the SL policy neural network from initial values of the parameters to trained values of the parameters. For example, the system can train the SL policy neural network using asynchronous stochastic gradient descent updates to maximize the log likelihood of the action identified by the action label for a given training observation” teaches an action probability (corresponds to random value) that matches the action labels for the labeled training data (corresponds to label-guided exploration probability) by providing the training observation and values of parameter (corresponds to the diagnosed disease in the medical records). Para. [0066], “the system trains the RL policy neural network to adjust the values of the parameters of the RL policy neural network to generate action probabilities that represent, for each action, the likelihood that for each action, a predicted likelihood that a long-term reward that will be received will be maximized if the action is performed by the agent in response to the observation instead of any other action in the set of possible action” teaches utilizing a neural network to selecting an action probability prediction (corresponds to a result prediction action) from a plurality of actions (corresponds to candidate prediction actions)).

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Henry T Nguyen whose telephone number is (571)272-8860. The examiner can normally be reached Monday-Friday 8:00am-5:00pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Kamran Afshar can be reached on (571) 272-7796. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000

/HENRY TRONG NGUYEN/
Examiner, Art Unit 2125
/BRIAN M SMITH/Primary Examiner, Art Unit 2122