DETAILED ACTION
This is the first office action regarding application number 16/071,884, filed July 20, 2018.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

Drawings
The drawings are objected to because of the following informality: 
Figure 2, element 112 (leftmost): text should read “REWARD RELATION EXTRACTION MODULE”. Appropriate correction is required. 
Figure 14, element S1403: to be consistent with the specification (refer to paragraph [0159]), text box should indicate that “conditions” are compared, not “rewards”. Appropriate correction is required.
Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. The figure or figure number of an amended drawing should not be labeled as “amended.” If a drawing figure is to be canceled, the appropriate figure must be removed from the replacement sheet, and where necessary, the remaining figures must be renumbered and appropriate changes made to the brief description of the several views 

Specification
The disclosure is objected to because of the following informalities:
Paragraph [0002]: the following phrase/term is unclear, and needs to be corrected: “One of objects for analyzing the relation among such information is for optimizing a system by using an evaluation formula indicating the relation among the information.” Appropriate correction is required.
Paragraph [0032]: the following typographical errors need to be corrected: “The server 100 performs the reinforcement learning, thereby learning a combination and the like of .” Appropriate correction is required.
Paragraphs [0072]-[0073]: Formula 1 in paragraph [0072] needs to be corrected in order to be consistent with the corresponding text in paragraph [0073], namely the following:
Formula 1 uses the symbol ‘a’ as the learning coefficient; in paragraph [0073] the symbol ‘α
Formula 1 uses an uppercase symbol ‘Γ’ for the discount rate; in paragraph [0073] the lowercase symbol ‘γ’ denotes the discount rate. Appropriate correction is required.
Formula 1 contains the symbol ‘MAX’, but paragraph [0073] does not explain this symbol or its usage. Appropriate correction is required.
Formula 1 contains the symbol ‘RT’, but paragraph [0073] does not explain this symbol or its usage. Appropriate correction is required.
Formula 1 contains “…” at the end, suggesting that more details are to follow (i.e., this equation is not complete). The complete equation needs to be disclosed. If Formula 1 is complete then “…” needs to be removed. Appropriate correction is required.

    PNG
    media_image1.png
    67
    675
    media_image1.png
    Greyscale


Claim Rejections - 35 USC § 112




The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:

The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 5-6 and 11-12 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Regarding Claim 5,
 “calculates a contribution value indicating a magnitude48DOCS 122900-019US1/3351527.1 Attorney Docket No.: 122900-019US1of contribution to the objective variable value of the reward included in the first entry” contains the term “objective variable value of the reward” which has no support in the specification and thus renders the claim indefinite. Paragraph [0014] states: “The agents 150 are control targets. The agents 150 of the present example acquire values indicating a state of an environment and perform actions according to instructions of the server 100. An object is set to each agent 150. For example, speeding-up of specific work is set as an object. In the present example, an object6 DOCS 122900-019US1/3351527.1is managed as an objective variable. In the following description, a value of the objective variable is also referred to as an objective variable value.”, and paragraph [0058] states: “The reward 504 is a reward given when the conditions defined by the variable name 502 and the value 503 are identical to each other.”, and refers to Figure 5. However, Figure 4 shows objective variable values associated with episodes, and Figure 5 shows that a separate reward is a value within a condition (entry), which does not contain an objective variable value. It is not clear what the term “objective variable value of the reward” in the claim is referencing, whether it is in reference to the actual reward value, or whether it is to indicate an implicit relationship with the condition (e.g., the reward function information entry), and hence this lack of clarity renders this claim as indefinite.
Claim 6 is a dependent claim of Claim 5 that inherits the same indefiniteness established above in Claim 5, and hence is also rejected as being indefinite by virtue of dependency.
Regarding Claim 6, 
The term "wherein when the contribution value is a value indicating that contribution of the reward included in the first entry to the objective variable value is small" in claim 6 is a relative term which renders the claim indefinite.  The term "small" is not defined by the claim, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention. Paragraph [0112] states: “In addition, the reward relation extraction module 112 may acquire a learning effect from the reinforcement learning module29 111, calculate a contribution value for an objective variable, and increase the value of the confidence value 505 of a reward in which the contribution value is low. In this way, only a reward, which contributes to speeding-up of a reinforcement learning, can be controlled to be shared.”, and paragraph [0158] states: “In addition, the calculation method of the distance of the reward function information 123 based on an attribute may be dynamically changed on the basis of a learning result. For example, even in similar conditions, in the case of a reward in which contribution to an objective variable value is low, a method is considered to set a weight coefficient such that the distance becomes long.”, but the specification does not provide a measurement or standard in which to determine the degree of a contribution value (e.g., low or high, small or low) or even whether the term “small contribution value” is equivalent to the term “low contribution value” used in the specification. While a person having ordinary skill in the relevant art would be able to apply a measurement to identify a low (or small) contribution value, there would be no way to determine whether their particular standard in which to identify a “low” or “small” contribution value would match the “small” contribution value as stated in the claim limitation, and hence this lack of clarity renders this claim as indefinite.
Furthermore, the term "the processor updates the confidence value included in the first entry to a value indicating that statistical confidence is low" in claim 6 is a relative term which renders the claim indefinite.  The term "low" is not defined by the claim, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention. Paragraph [0042] states: “The reward sharing extraction 114 registers the updated reward function information 123 in the reward function sharing information 124. Furthermore, the reward sharing extraction 114 determines whether there is a reward with a high statistical confidence with reference to the reward function information 123 used for the reinforcement learning for other agents 150. When there is the reward with a high statistical confidence, the reward sharing13 extraction 114 corrects the reward function information 123 by reflecting the reward in the reward function information 123.”, but the specification does not provide a measurement or standard in which to determine the degree of a confidence value (e.g., low or high). While a person having ordinary skill in the relevant art would be able to apply a measurement to identify a low confidence value, there would be no way to determine whether their particular standard in which to identify a “low confidence value” would match the “low confidence value” as stated in the claim limitation, and hence this lack of clarity renders this claim as indefinite.
Regarding Claim 11,
The claim limitation “calculates a contribution value indicating a magnitude48DOCS 122900-019US1/3351527.1 Attorney Docket No.: 122900-019US1of contribution to the objective variable value of the reward included in the first entry” contains the term “objective variable value of the reward” which has no support in the specification and thus renders the claim indefinite. Paragraph [0014] states: “The agents 150 are control targets. The agents 150 of the present example acquire values indicating a state of an environment and perform actions according to instructions of the server 100. An object is set to each agent 150. For example, speeding-up of specific work is set as an object. In the present example, an object6 DOCS 122900-019US1/3351527.1is managed as an objective variable. In the following description, a value of the objective variable is also referred to as an objective variable value.”, and paragraph [0058] states: “The reward 504 is a reward given when the conditions defined by the variable name 502 and the value 503 are identical to each other.”, and refers to Figure 5. However, Figure 4 shows objective variable values associated with episodes, and Figure 5 shows that a separate reward is a value within a condition (entry), which does not contain an objective variable value. It is not clear what the term “objective variable value of the reward” in the claim is referencing, whether it is in reference to the actual reward value, or whether it is to indicate an implicit relationship with the condition (e.g., the reward function information entry), and hence this lack of clarity renders this claim as indefinite.
hence is also rejected as being indefinite by virtue of dependency.
Regarding Claim 12, 
The term "the fifth step includes a step in which, when the contribution value is a value indicating that contribution of the reward included in the first entry to the objective variable value is small" in claim 6 is a relative term which renders the claim indefinite.  The term "small" is not defined by the claim, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention. Paragraph [0112] states: “In addition, the reward relation extraction module 112 may acquire a learning effect from the reinforcement learning module29 111, calculate a contribution value for an objective variable, and increase the value of the confidence value 505 of a reward in which the contribution value is low. In this way, only a reward, which contributes to speeding-up of a reinforcement learning, can be controlled to be shared.”, and paragraph [0158] states: “In addition, the calculation method of the distance of the reward function information 123 based on an attribute may be dynamically changed on the basis of a learning result. For example, even in similar conditions, in the case of a reward in which contribution to an objective variable value is low, a method is considered to set a weight coefficient such that the distance becomes long.”, but the specification does not provide a measurement or standard in which to determine the degree of a contribution value (e.g., low or high, small or low) or even whether the term “small contribution value” is equivalent to the term “low contribution value” used in the specification. While a person having ordinary skill in the relevant art would be able to apply a measurement to identify a low (or small) contribution value, there would be no way to determine whether their particular standard in which to identify a “low” or “small” contribution value would match the “small” contribution value as stated in the claim limitation, and hence this lack of clarity renders this claim as indefinite.
the processor updates the confidence value included in the first entry to a value indicating that statistical confidence is low" in claim 6 is a relative term which renders the claim indefinite.  The term "low" is not defined by the claim, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention. Paragraph [0042] states: “The reward sharing extraction 114 registers the updated reward function information 123 in the reward function sharing information 124. Furthermore, the reward sharing extraction 114 determines whether there is a reward with a high statistical confidence with reference to the reward function information 123 used for the reinforcement learning for other agents 150. When there is the reward with a high statistical confidence, the reward sharing13 extraction 114 corrects the reward function information 123 by reflecting the reward in the reward function information 123.”, but the specification does not provide a measurement or standard in which to determine the degree of a confidence value (e.g., low or high). While a person having ordinary skill in the relevant art would be able to apply a measurement to identify a low confidence value, there would be no way to determine whether their particular standard in which to identify a “low confidence value” would match the “low confidence value” as stated in the claim limitation, and hence this lack of clarity renders this claim as indefinite.

The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly 

Claims 1 and 7 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention.
Regarding Claim 1, the claim recites the limitation “the processor compares the rewards of the first reward function information and the second reward function information with each other, specifies a reward, which is reflected in the first reward function information from rewards set in the second reward function information”. Paragraph [0007] states: “… the processor compares the rewards of the first reward function information and the second reward function information with each other, specifies a reward, which is reflected in the first reward function information from rewards set in the second reward function information, updates the first reward function information on the basis of the specified reward, and3 decides an optimal action of the first control target by using the first reward function information.”, and paragraph [0149] states: “Specifically, the reward sharing extraction 114 specifies conditions, which are defined by similar variables, on the basis of the attribute information 1101, compares rewards of the specified conditions with each other, and corrects the reward function information 123 on the basis of the comparison result.”, which are mere re-phrasings of the claim limitation. However, the subsequent paragraphs in the specification which actually describe the reward sharing extraction describe a “comparison of conditions” (which contain other fields other than rewards), which has a different meaning and interpretation than a “comparison of rewards”. For example, paragraph [0126] states: “Next, the reward sharing extraction 114 compares a condition (entry) of reward function information 123 different from the read reward function information 123 with the selected condition (entry) (step S904), and determines whether it is necessary to reflect a reward which is used by other agents 150 (step S905).”. Figure 9 step 904 corroborates this description with the text “compare condition of same condition with reference to reward function information of each agent”. In addition, paragraph [0159] similarly states: “Next, the reward sharing extraction 114 compares a condition (entry) of the specified reward function information 123 with the selected condition (entry) (step S1403), and determines whether it is necessary to reflect a reward which is used by the other agents 150 (step S905).” Hence, the specification does not provide support for comparing rewards as indicated in Claim 1. The specification must describe and support the claims such that the public is informed of the boundaries of what constitutes infringement of the patent, as well as determining whether the claimed invention meets all the criteria for patentability by distinctly claiming the subject matter which the inventor regards as the invention. See MPEP 2163. Given that there is no support of this limitation present in the specification, this claim limitation fails to comply with the written description requirement. For the purposes of examination, the claim limitation will be interpreted as “the processor compares conditions of the first reward function information and the second reward function information with each other, specifies a reward, which is reflected in the first reward function information from rewards set in the second reward function information”.
Claims 2-6 are dependent claims that trace their parent claim to Claim 1. However, claim 2 resolves the §112(a) lack of written description present in Claim 1 by indicating that the conditions relating to the reward function information are compared (see claim 2 limitation “the processor specifies two entries to be compared with reference to the condition of the first reward function information and the condition of the second reward function information”), and hence the subsequent dependent claims 2-6 are not rejected under the §112(a) written description requirement.
Regarding Claim 7, the claim recites the same claim limitation as analyzed above in Claim 1 (“the processor compares the rewards of the first reward function information and the second reward function information with each other, specifies a reward, which is reflected in the first reward function information from rewards set in the second reward function information”), and hence is also rejected under 112(a) for failing to comply with the written description requirement.
Claims 8-12 are dependent claims that trace their parent claim to Claim 7. However, claim 8 resolves the §112(a) lack of written description present in Claim 7 by indicating that the conditions relating to the reward function information are compared (see claim 8 limitation “the processor specifies two entries to be compared with reference to the condition of the first reward function information and the condition of the second reward function information”), and hence the subsequent dependent claims 8-12 are not rejected under the §112(a) written description requirement.

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


When considering subject matter eligibility under 35 U.S.C. 101, it must be determined whether the claim is directed to one of the four statutory categories of invention, i.e., process, machine, manufacture, or composition of matter (Step 1). If the claim does fall within one of the statutory categories, the second step in the analysis is to determine whether the claim is directed to a judicial exception (Step 2A). The Step 2A analysis is broken into two prongs. In the first prong (Step 2A, Prong 1), it is determined whether or not the claims recite a judicial exception (e.g., mathematical concepts, mental processes, certain methods of organizing 
Claims 1-12 are rejected under 35 U.S.C. 101 because the claimed invention is directed to an abstract idea without significantly more than the abstract idea itself, and hence is not patent-eligible subject matter. 
Regarding Claim 1,
Step 1: The claim recites a computer system, therefore it falls into one of the four statutory categories (i.e., process, machine, article of manufacture, or composition of matter).
Step 2A Prong 1: This claim recites the following abstract ideas:
when updating the first reward function information, the processor compares (Under its broadest reasonable interpretation, this claim element recites a judicial exception, as comparing reward function information represents a mental process (observations, judgments, evaluations, opinions) that is implementable in the human mind, using a generic computer as a tool to perform the mental process. See MPEP 2106.04(a)(2)(III-C).), …
… the processor … specifies a reward, which is reflected in the first reward function information from rewards set in the second reward function information (Under its broadest reasonable interpretation, this claim element recites a judicial exception, as specifying a reward , …
… the processor … decides an optimal action of the first control target by using the first reward function information (Under its broadest reasonable interpretation, this claim element recites a judicial exception, as comparing reward function information represents a mental process (observations, judgments, evaluations, opinions) that is implementable in the human mind, using a generic computer as a tool to perform the mental process. See MPEP 2106.04(a)(2)(III-C).).  
Step 2A Prong 2: This claim further recites:
at least one computer including a processor and a memory connected to the processor (This claim element is considered a form of applying mere instructions on a generic computer to implement a judicial exception. See MPEP 2106.05(f). This claim element is also directed to a general linking to a technological environment. See MPEP 2106.05(h). This additional element does not add a meaningful limitation to the claim, and hence does not integrate the judicial exception into a practical application.), …
wherein a plurality of pieces of reward function information for defining rewards for states and actions of the control targets is managed for each of the control targets (This claim element of managing (for each of the control targets) a plurality of pieces of reward function information for defining rewards for states and actions of the control targets is directed to a form of pre-solution/insignificant extra-solution activity for use in a claimed process. See MPEP 2106.05(g). This additional element does not add a meaningful limitation to the claim, and hence does not integrate the judicial exception into a practical application.), 
the plurality of pieces of reward function information includes first reward function information for defining the reward of a first control target and second reward function information for defining the reward of a second control target (This claim element places an additional limitation on the type of plurality of pieces of reward function information, as well as generally linking the system to a technological environment. Type definitions and a general association to a technological environment do not further integrate the judicial exception into a practical application. See MPEP 2106.05(h).), …
… the processor … updates the first reward function information on the basis of the specified reward (This claim element of updating a reward based on a specified reward is directed to a form of insignificant extra-solution activity for use in a claimed process. See MPEP 2106.05(g). This additional element does not add a meaningful limitation to the claim, and hence does not integrate the judicial exception into a practical application.), …
Step 2B: This claim further recites:
at least one computer including a processor and a memory connected to the processor (As analyzed in Step 2A Prong 2, applying mere instructions on a generic computer to implement a judicial exception does not further integrate the judicial exception into a practical application. See MPEP 2106.05(f). Hence this claim element does not add significantly more than the judicial exception, alone or in combination with other elements in the claim.), …
wherein a plurality of pieces of reward function information for defining rewards for states and actions of the control targets is managed for each of the control targets (This claim element is directed to storing and retrieving information in memory, which is a well-known, understood, routine, conventional activity, and hence does not add significantly more than the judicial exception, alone or in combination with other elements in the claim. See MPEP 2106.05(d)(II), list 1, example iv.), 
the plurality of pieces of reward function information includes first reward function information for defining the reward of a first control target and second reward function information for defining the reward of a second control target (As analyzed in Step 2A Prong 2, type definitions and a general linking to a technological environment do not further integrate the , …
… the processor … updates the first reward function information on the basis of the specified reward (This claim element is directed to storing and retrieving information in memory, which is a well-known, understood, routine, conventional activity, and hence does not add significantly more than the judicial exception, alone or in combination with other elements in the claim. See MPEP 2106.05(d)(II), list 1, example iv.), …
Regarding Claim 2,
Step 1: The claim recites the computer system according to claim 1, therefore it falls into one of the four statutory categories (i.e., process, machine, article of manufacture, or composition of matter).
Step 2A Prong 1: This claim is a dependent claim of Claim 1, and hence inherits the same abstract ideas mentioned above. This claim further recites the following abstract ideas:
the processor … specifies two entries to be compared with reference to the condition of the first reward function information and the condition of the second reward function information (Under its broadest reasonable interpretation, this claim element recites a judicial exception, as specifying two entries to be compared is a form of decision-making, which represents a mental process (observations, judgments, evaluations, opinions) that is implementable in the human mind, using a generic computer as a tool to perform the mental process. See MPEP 2106.04(a)(2)(III-C).), …
the processor … determines whether to allow the reward included in an entry specified from the second reward function information to be reflected as the reward included in an entry specified from the first reward function information on the basis of the confidence value of the specified two entries (Under its broadest reasonable interpretation, this claim element recites a judicial exception, as determining whether to allow the reward represents a mental process , 
Step 2A Prong 2: This claim further recites:
wherein the reward function information includes an entry constituted by a condition defined from at least one of the state and the action, the reward given when the condition is satisfied, and a confidence value indicating statistical confidence of the reward (This claim element places an additional limitation on the content of reward function information (e.g., an entry constituted by a condition defined from at least one of the state and the action, the reward, … and a confidence value), as well as generally linking the system to a technological environment. Type definitions and a general association to a technological environment do not further integrate the judicial exception into a practical application. See MPEP 2106.05(h).), …
the processor … sets the reward included in the entry specified from the second reward function information in the entry specified from the first reward function information when the reward included in the entry specified from the second reward function information is reflected as the reward included in the entry specified from the first reward function information (This claim element of setting the reward is directed to a form of insignificant extra-solution activity for use in a claimed process. See MPEP 2106.05(g). This additional element does not add a meaningful limitation to the claim, and hence does not integrate the judicial exception into a practical application.).  
Step 2B: This claim further recites:
wherein the reward function information includes an entry constituted by a condition defined from at least one of the state and the action, the reward given when the condition is satisfied, and a confidence value indicating statistical confidence of the reward (As analyzed in Step 2A Prong 2, type definitions and a general linking to a technological environment do not further integrate the judicial exception into a practical application. See MPEP 2106.05(h). Hence , …
the processor … sets the reward included in the entry specified from the second reward function information in the entry specified from the first reward function information when the reward included in the entry specified from the second reward function information is reflected as the reward included in the entry specified from the first reward function information (This claim element is directed to storing and retrieving information in memory, which is a well-known, understood, routine, conventional activity, and hence does not add significantly more than the judicial exception, alone or in combination with other elements in the claim. See MPEP 2106.05(d)(II), list 1, example iv.).  
Regarding Claim 3,
Step 1: The claim recites the computer system of claim 2, therefore it falls into one of the four statutory categories (i.e., process, machine, article of manufacture, or composition of matter).
Step 2A Prong 1: This claim is a dependent claim of Claim 2, and hence inherits the same abstract ideas mentioned above. This claim further recites the following abstract ideas:
the processor selects a first entry from the first reward function information (Under its broadest reasonable interpretation, this claim element recites a judicial exception, as selecting a first entry represents a mental process (observations, judgments, evaluations, opinions) that is implementable in the human mind, using a generic computer as a tool to perform the mental process. See MPEP 2106.04(a)(2)(III-C).), 
the processor … specifies a second entry, in which a condition similar to a condition included in the first entry is set, from entries included in the second reward function information on the basis of the attribute information (Under its broadest reasonable interpretation, this claim element recites a judicial exception, as specifying a second entry is a form of decision-making, which represents a mental process (observations, judgments, evaluations, opinions) that is , and 
the processor … determines whether to allow the reward included in the second entry to be reflected in the reward included in the first entry on the basis of the confidence value included in the first entry and the confidence value included in the second entry (Under its broadest reasonable interpretation, this claim element recites a judicial exception, as determining whether to allow the reward on the basis of the confidence value from the first and second entries represents a mental process (observations, judgments, evaluations, opinions) that is implementable in the human mind, using a generic computer as a tool to perform the mental process. See MPEP 2106.04(a)(2)(III-C).).  
Step 2A Prong 2: This claim further recites:
wherein the at least one computer manages attribute information for managing an attribute of a state constituting a condition of the plurality of pieces of reward function information (This claim element of managing attribute information is directed to a form of insignificant extra-solution activity for use in a claimed process. See MPEP 2106.05(g). This additional element does not add a meaningful limitation to the claim, and hence does not integrate the judicial exception into a practical application.), …
Step 2B: This claim further recites:
wherein the at least one computer manages attribute information for managing an attribute of a state constituting a condition of the plurality of pieces of reward function information (This claim element is directed to storing and retrieving information in memory, which is a well-known, understood, routine, conventional activity, and hence does not add significantly more than the judicial exception, alone or in combination with other elements in the claim. See MPEP 2106.05(d)(II), list 1, example iv.), …
Regarding Claim 4, 
Step 1: The claim recites the computer system according to claim 3, therefore it falls into one of the four statutory categories (i.e., process, machine, article of manufacture, or composition of matter).
Step 2A Prong 1: This claim is a dependent claim of claim 3, and hence inherits the same abstract ideas mentioned above. This claim further recites the following abstract ideas:
wherein the processor decides a combination of optimal actions of the first control target by using the first reward function information (Under its broadest reasonable interpretation, this claim element recites a judicial exception, as deciding optimal actions using reward function information represents a mental process (observations, judgments, evaluations, opinions) that is implementable in the human mind, using a generic computer as a tool to perform the mental process. See MPEP 2106.04(a)(2)(III-C).), 
… the processor … calculates an association value indicating an association between the condition and an objective variable value for defining the optimal actions of the first control target (Under its broadest reasonable interpretation, this claim element recites a judicial exception, as calculating an association value between a condition and objective variable value is a form of assigning a value, which represents a mental process (observations, judgments, evaluations, opinions) that is implementable in the human mind, using a generic computer as a tool to perform the mental process. See MPEP 2106.04(a)(2)(III-C).), and 
… the processor … calculates the confidence value on the basis of the association value (Under its broadest reasonable interpretation, this claim element recites a judicial exception, as calculating a confidence value based on the association value is a form of assigning a value, which represents a mental process (observations, judgments, evaluations, opinions) that is implementable in the human mind, using a generic computer as a tool to perform the mental process. See MPEP 2106.04(a)(2)(III-C).).  
Step 2A Prong 2: This claim does not recite any additional elements to be further analyzed at this step.
Step 2B: This claim does not recite any additional elements to be further analyzed at this step.
Regarding Claim 5,
Step 1: The claim recites the computer system according to claim 4, therefore it falls into one of the four statutory categories (i.e., process, machine, article of manufacture, or composition of matter).
Step 2A Prong 1: This claim is a dependent claim of claim 4, and hence inherits the same abstract ideas mentioned above. This claim further recites the following abstract ideas:
wherein the processor decides the combination of the optimal actions of the first control target by using the first reward function information (Under its broadest reasonable interpretation, this claim element recites a judicial exception, as deciding optimal actions by using reward function information represents a mental process (observations, judgments, evaluations, opinions) that is implementable in the human mind, using a generic computer as a tool to perform the mental process. See MPEP 2106.04(a)(2)(III-C).), 
… the processor … calculates a contribution value indicating a magnitude of contribution to the objective variable value of the reward included in the first entry (Under its broadest reasonable interpretation, this claim element recites a judicial exception, as calculating (applying) a contribution value based on a magnitude of contribution to an objective variable value represents organizing information and manipulating information through mathematical correlations. See MPEP 2106.04(a)(2)(I-A), example iv.), and
… the processor … updates the confidence value included in the first entry on the basis of the contribution value (Under its broadest reasonable interpretation, this claim element recites a judicial exception, as updating (applying) a confidence value based on a contribution value represents organizing information and manipulating information through mathematical correlations. See MPEP 2106.04(a)(2)(I-A), example iv.).

Step 2A Prong 2: This claim does not recite any additional elements to be further analyzed at this step.
Step 2B: This claim does not recite any additional elements to be further analyzed at this step.
Regarding Claim 6, 
Step 1: The claim recites the computer system according to claim 5, therefore it falls into one of the four statutory categories (i.e., process, machine, article of manufacture, or composition of matter).
Step 2A Prong 1: This claim is a dependent claim of claim 5, and hence inherits the same abstract ideas mentioned above. This claim further recites the following abstract ideas:
wherein when the contribution value is a value indicating that contribution of the reward included in the first entry to the objective variable value is small (Under its broadest reasonable interpretation, this claim element recites a judicial exception, as deciding when a contribution value is small represents a mental process (observations, judgments, evaluations, opinions) that is implementable in the human mind, using a generic computer as a tool to perform the mental process. See MPEP 2106.04(a)(2)(III-C).), …
Step 2A Prong 2: This claim further recites:
the processor updates the confidence value included in the first entry to a value indicating that statistical confidence is low (This claim element of updating (setting) the confidence value to a low value is directed to a form of insignificant extra-solution activity for use in a claimed process. See MPEP 2106.05(g). This additional element does not add a meaningful limitation to the claim, and hence does not integrate the judicial exception into a practical application.)
Step 2B: This claim further recites:
the processor updates the confidence value included in the first entry to a value indicating that statistical confidence is low (This claim element is directed to storing and 
Regarding Claim 7, 
Step 1: The claim recites a control method in a computer system, therefore it falls into one of the four statutory categories (i.e., process, machine, article of manufacture, or composition of matter).
Step 2A Prong 1: This claim recites the following abstract ideas:
a first step in which, when updating the first reward function information, the processor compares (Under its broadest reasonable interpretation, this claim element recites a judicial exception, as comparing reward function information represents a mental process (observations, judgments, evaluations, opinions) that is implementable in the human mind, using a generic computer as a tool to perform the mental process. See MPEP 2106.04(a)(2)(III-C).), …
a first step in which, … the processor … specifies a reward, which is reflected in the first reward function information from rewards set in the second reward function information (Under its broadest reasonable interpretation, this claim element recites a judicial exception, as specifying a reward function information is a form of decision-making, which represents a mental process (observations, judgments, evaluations, opinions) that is implementable in the human mind, using a generic computer as a tool to perform the mental process. See MPEP 2106.04(a)(2)(III-C).), …
a third step in which the processor decides an optimal action of the first control target by using the first reward function information (Under its broadest reasonable interpretation, this claim element recites a judicial exception, as comparing reward function information represents a mental process (observations, judgments, evaluations, opinions) that is implementable in the .  
Step 2A Prong 2: This claim further recites:
the computer system including at least one computer including a processor and a memory connected to the processor (This claim element is considered a form of applying mere instructions on a generic computer to implement a judicial exception. See MPEP 2106.05(f). This claim element is also directed to a general linking to a technological environment. See MPEP 2106.05(h). This additional element does not add a meaningful limitation to the claim, and hence does not integrate the judicial exception into a practical application.), …
a plurality of pieces of reward function information for defining rewards for states and actions of the control targets being managed for each of the control targets (This claim element of managing (for each of the control targets) a plurality of pieces of reward function information for defining rewards for states and actions of the control targets is directed to a form of pre-solution/insignificant extra-solution activity for use in a claimed process. See MPEP 2106.05(g). This additional element does not add a meaningful limitation to the claim, and hence does not integrate the judicial exception into a practical application.), 
the plurality of pieces of reward function information including first reward function information for defining the reward of a first control target and second reward function information for defining the reward of a second control target (This claim element places an additional limitation on the type of plurality of pieces of reward function information, as well as generally linking the system to a technological environment. Type definitions and a general association to a technological environment do not further integrate the judicial exception into a practical application. See MPEP 2106.05(h).), …
a second step in which the processor … updates the first reward function information on the basis of the specified reward (This claim element of updating a reward based on a specified reward is directed to a form of insignificant extra-solution activity for use in a claimed process. , …
Step 2B: This claim further recites:
the computer system including at least one computer including a processor and a memory connected to the processor (As analyzed in Step 2A Prong 2, applying mere instructions on a generic computer to implement a judicial exception does not further integrate the judicial exception into a practical application. See MPEP 2106.05(f). Hence this claim element does not add significantly more than the judicial exception, alone or in combination with other elements in the claim.), …
a plurality of pieces of reward function information for defining rewards for states and actions of the control targets being managed for each of the control targets (This claim element is directed to storing and retrieving information in memory, which is a well-known, understood, routine, conventional activity, and hence does not add significantly more than the judicial exception, alone or in combination with other elements in the claim. See MPEP 2106.05(d)(II), list 1, example iv.), 
the plurality of pieces of reward function information including first reward function information for defining the reward of a first control target and second reward function information for defining the reward of a second control target (As analyzed in Step 2A Prong 2, type definitions and a general linking to a technological environment do not further integrate the judicial exception into a practical application. See MPEP 2106.05(h). Hence this claim element does not add significantly more than the judicial exception, alone or in combination with other elements in the claim.), …
a second step in which the processor … updates the first reward function information on the basis of the specified reward (This claim element is directed to storing and retrieving information in memory, which is a well-known, understood, routine, conventional activity, and , …
Regarding Claim 8, 
Step 1: The claim recites the control method according to claim 7, therefore it falls into one of the four statutory categories (i.e., process, machine, article of manufacture, or composition of matter).
Step 2A Prong 1: This claim is a dependent claim of Claim 7, and hence inherits the same abstract ideas mentioned above. This claim further recites the following abstract ideas:
the first step includes a step in which the processor specifies two entries to be compared with reference to the condition of the first reward function information and the condition of the second reward function information (Under its broadest reasonable interpretation, this claim element recites a judicial exception, as specifying two entries to be compared is a form of decision-making, which represents a mental process (observations, judgments, evaluations, opinions) that is implementable in the human mind, using a generic computer as a tool to perform the mental process. See MPEP 2106.04(a)(2)(III-C).), …
the first step includes … a step in which the processor determines whether to allow the reward included in an entry specified from the second reward function information to be reflected as the reward included in an entry specified from the first reward function information on the basis of the confidence value of the specified two entries (Under its broadest reasonable interpretation, this claim element recites a judicial exception, as determining whether to allow the reward represents a mental process (observations, judgments, evaluations, opinions) that is implementable in the human mind, using a generic computer as a tool to perform the mental process. See MPEP 2106.04(a)(2)(III-C).), 
Step 2A Prong 2: This claim further recites:
wherein the reward function information includes an entry constituted by a condition defined from at least one of the state and the action, the reward given when the condition is satisfied, and a confidence value indicating statistical confidence of the reward (This claim element places an additional limitation on the content of reward function information (e.g., an entry constituted by a condition defined from at least one of the state and the action, the reward, … and a confidence value), as well as generally linking the system to a technological environment. Type definitions and a general association to a technological environment do not further integrate the judicial exception into a practical application. See MPEP 2106.05(h).), …
a second step includes a step in which, when the reward included in the entry specified from the second reward function information is reflected as the reward included in the entry specified from the first reward function information, the processor sets the reward included in the entry specified from the second reward function information in the entry specified from the first reward function information (This claim element of setting the reward is directed to a form of insignificant extra-solution activity for use in a claimed process. See MPEP 2106.05(g). This additional element does not add a meaningful limitation to the claim, and hence does not integrate the judicial exception into a practical application.).  
Step 2B: This claim further recites:
wherein the reward function information includes an entry constituted by a condition defined from at least one of the state and the action, the reward given when the condition is satisfied, and a confidence value indicating statistical confidence of the reward (As analyzed in Step 2A Prong 2, type definitions and a general linking to a technological environment do not further integrate the judicial exception into a practical application. See MPEP 2106.05(h). Hence this claim element does not add significantly more than the judicial exception, alone or in combination with other elements in the claim.), …
a second step includes a step in which, when the reward included in the entry specified from the second reward function information is reflected as the reward included in the entry specified from the first reward function information, the processor sets the reward included in the entry specified from the second reward function information in the entry specified from the first reward function information when the reward included in the entry specified from the second reward function information is reflected as the reward included in the entry specified from the first reward function information (This claim element is directed to storing and retrieving information in memory, which is a well-known, understood, routine, conventional activity, and hence does not add significantly more than the judicial exception, alone or in combination with other elements in the claim. See MPEP 2106.05(d)(II), list 1, example iv.).  
Regarding Claim 9,
Step 1: The claim recites the control method according to claim 8, therefore it falls into one of the four statutory categories (i.e., process, machine, article of manufacture, or composition of matter).
Step 2A Prong 1: This claim is a dependent claim of Claim 8, and hence inherits the same abstract ideas mentioned above. This claim further recites the following abstract ideas:
the first step includes a step in which the processor selects a first entry from the first reward function information (Under its broadest reasonable interpretation, this claim element recites a judicial exception, as selecting a first entry represents a mental process (observations, judgments, evaluations, opinions) that is implementable in the human mind, using a generic computer as a tool to perform the mental process. See MPEP 2106.04(a)(2)(III-C).), 
the first step includes … a step in which the processor specifies a second entry, in which a condition similar to a condition included in the first entry is set, from entries included in the second reward function information on the basis of the attribute information (Under its broadest reasonable interpretation, this claim element recites a judicial exception, as specifying a second entry is a form of decision-making, which represents a mental process (observations, judgments, evaluations, opinions) that is implementable in the human mind, using a generic computer as a tool to perform the mental process. See MPEP 2106.04(a)(2)(III-C).), and 
the first step includes … a step in which the processor determines whether to allow the reward included in the second entry to be reflected in the reward included in the first entry on the basis of the confidence value included in the first entry and the confidence value included in the second entry (Under its broadest reasonable interpretation, this claim element recites a judicial exception, as determining whether to allow the reward on the basis of the confidence value from the first and second entries represents a mental process (observations, judgments, evaluations, opinions) that is implementable in the human mind, using a generic computer as a tool to perform the mental process. See MPEP 2106.04(a)(2)(III-C).).  
Step 2A Prong 2: This claim further recites:
wherein the at least one computer manages attribute information for managing an attribute of a state constituting a condition of the plurality of pieces of reward function information (This claim element of managing attribute information is directed to a form of insignificant extra-solution activity for use in a claimed process. See MPEP 2106.05(g). This additional element does not add a meaningful limitation to the claim, and hence does not integrate the judicial exception into a practical application.), …
Step 2B: This claim further recites:
wherein the at least one computer manages attribute information for managing an attribute of a state constituting a condition of the plurality of pieces of reward function information (This claim element is directed to storing and retrieving information in memory, which is a well-known, understood, routine, conventional activity, and hence does not add significantly more than the judicial exception, alone or in combination with other elements in the claim. See MPEP 2106.05(d)(II), list 1, example iv.), …
Regarding Claim 10,
Step 1: The claim recites the control method according to claim 9, therefore it falls into one of the four statutory categories (i.e., process, machine, article of manufacture, or composition of matter).
Step 2A Prong 1: This claim is a dependent claim of claim 9, and hence inherits the same abstract ideas mentioned above. This claim further recites the following abstract ideas:
a step in which the processor decides a combination of optimal actions of the first control target by using the first reward function information (Under its broadest reasonable interpretation, this claim element recites a judicial exception, as deciding optimal actions using reward function information represents a mental process (observations, judgments, evaluations, opinions) that is implementable in the human mind, using a generic computer as a tool to perform the mental process. See MPEP 2106.04(a)(2)(III-C).), 
a step in which the processor calculates an association value indicating an association between the condition and an objective variable value for defining the optimal actions of the first control target (Under its broadest reasonable interpretation, this claim element recites a judicial exception, as calculating an association value between a condition and objective variable value is a form of assigning a value, which represents a mental process (observations, judgments, evaluations, opinions) that is implementable in the human mind, using a generic computer as a tool to perform the mental process. See MPEP 2106.04(a)(2)(III-C).), and 
… the processor … calculates the confidence value on the basis of the association value (Under its broadest reasonable interpretation, this claim element recites a judicial exception, as calculating a confidence value based on the association value is a form of assigning a value, which represents a mental process (observations, judgments, evaluations, opinions) that is implementable in the human mind, using a generic computer as a tool to perform the mental process. See MPEP 2106.04(a)(2)(III-C).).  
Step 2A Prong 2: This claim does not recite any additional elements to be further analyzed at this step.
Step 2B: This claim does not recite any additional elements to be further analyzed at this step.
Regarding Claim 11,
Step 1: The claim recites the control method according to claim 10, therefore it falls into one of the four statutory categories (i.e., process, machine, article of manufacture, or composition of matter).
Step 2A Prong 1: This claim is a dependent claim of claim 10, and hence inherits the same abstract ideas mentioned above. This claim further recites the following abstract ideas:
a fourth step in which the processor decides the combination of the optimal actions of the first control target by using the first reward function information (Under its broadest reasonable interpretation, this claim element recites a judicial exception, as deciding optimal actions by using reward function information represents a mental process (observations, judgments, evaluations, opinions) that is implementable in the human mind, using a generic computer as a tool to perform the mental process. See MPEP 2106.04(a)(2)(III-C).), 
a fourth step in which the processor … calculates a contribution value indicating a magnitude of contribution to the objective variable value of the reward included in the first entry (Under its broadest reasonable interpretation, this claim element recites a judicial exception, as calculating (applying) a contribution value based on a magnitude of contribution to an objective variable value represents organizing information and manipulating information through mathematical correlations. See MPEP 2106.04(a)(2)(I-A), example iv.), and 
a fifth step in which the processor updates the confidence value included in the first entry on the basis of the contribution value (Under its broadest reasonable interpretation, this claim element recites a judicial exception, as updating (applying) a confidence value based on a contribution value represents organizing information and manipulating information through mathematical correlations. See MPEP 2106.04(a)(2)(I-A), example iv.).  
Step 2A Prong 2: This claim does not recite any additional elements to be further analyzed at this step.
Step 2B: This claim does not recite any additional elements to be further analyzed at this step.
Regarding Claim 12,
Step 1: The claim recites the control method according to claim 11, therefore it falls into one of the four statutory categories (i.e., process, machine, article of manufacture, or composition of matter).
Step 2A Prong 1: This claim is a dependent claim of claim 11, and hence inherits the same abstract ideas mentioned above. This claim further recites the following abstract ideas:
wherein the fifth step includes a step in which, when the contribution value is a value indicating that contribution of the reward included in the first entry to the objective variable value is small (Under its broadest reasonable interpretation, this claim element recites a judicial exception, as deciding when a contribution value is small represents a mental process (observations, judgments, evaluations, opinions) that is implementable in the human mind, using a generic computer as a tool to perform the mental process. See MPEP 2106.04(a)(2)(III-C).), …
Step 2A Prong 2: This claim further recites:
the processor updates the confidence value included in the first entry to a value indicating that statistical confidence is low (This claim element of updating (setting) the confidence value to a low value is directed to a form of insignificant extra-solution activity for use in a claimed process. See MPEP 2106.05(g). This additional element does not add a meaningful limitation to the claim, and hence does not integrate the judicial exception into a practical application.)
Step 2B: the processor updates the confidence value included in the first entry to a value indicating that statistical confidence is low (This claim element is directed to storing and retrieving information in memory, which is a well-known, understood, routine, conventional activity, and hence does not add significantly more than the judicial exception, alone or in combination with other elements in the claim. See MPEP 2106.05(d)(II), list 1, example iv.)

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1-4 and 7-10 are rejected under 35 U.S.C. 103 as being unpatentable over Maehara, Masakazu, U.S. PGPUB 2014/0135952, published 5/15/2014 [hereafter referred as Maehara] in view of Ribeiro et al., Interaction Models for Multiagent Reinforcement Learning, CIMCA 2008, IAWTIC 2008, and ISE 2008, IEEE, pp.464-469 [hereafter referred as Ribeiro].
Regarding Claim 1, Maehara teaches
A computer system that performs optimization of each of a plurality of control targets, the computer system comprising:
at least one computer including a processor ([Maehara paragraph [0034]) and a memory ([Maehara paragraph [0034]) connected to the processor, 
wherein a plurality of pieces of reward function information for defining rewards ([Maehara Figure 5; paragraph [0039]: pay value is interpreted as a reward (“a pay value based on the state variation obtained from each of the plurality of home devices 2a to 2e.”).]) for states ([Maehara Figure 5; paragraph [0039]: “The agent input values X1a to X1e indicate device information (e.g., a state variation)”]) and actions ([Maehara Figure 5; paragraph [0039]: “control commands Y1a to Y1e”]) of the control targets ([Maehara Figure 5; paragraph [0039]: “plurality of home device agents 30a to 30e”]) is managed for each of the control targets [Maehara Figure 4, element 37; paragraph [0039]: reinforcement learning manages the state information and behaviors (actions) associated with the home device agents (“managed for each of the control targets”) (“Further, the agent management unit 37 controls the study of the plurality of home devices 2a to 2e using a reinforcement learning operation.”).]), 
the plurality of pieces of reward function information includes first reward function information for defining the reward of a first control target and second reward function information for defining the reward of a second control target ([Maehara Figure 5; paragraph [0039]: pay value (“reward”) for each respective home device agent (“the reward of a first control target”; “the reward of a second control target”); the set of <state information, action, rewards> form reward function information associated with each control target (“first reward function information”; “second reward function information”), with the collective set of reward function information from all home device agents forming “the plurality of pieces of reward function information” (“As a result of the inputting and the outputting, the agent management unit 37 calculates a pay value based on the state variation obtained from each of the plurality of home devices 2a to 2e and updates the value function of the plurality of home device agents 30a to 30e by using the pay value as a parameter.”).]), and 
when updating the first reward function information ([Maehara Figure 4, element 37; paragraph [0039]: applying reinforcement learning to the state/state variation information and behaviors (actions) associated with the home devices (and their corresponding home device agents receiving state and outputting actions) (“… In the reinforcement learning operation (the value function), a learning method, which is applicable to the continuous state space and behavior space, may be used.”).] [Maehara Figure 4, elements 37; paragraph [0041]: calculating a pay value (“reward”) and updating the value function in turn updates the reward function information for all home device agents (“when updating the first reward function information”) (“The agent management unit 37 calculates the pay value by using the difference in the power consumption as a state variation of home device obtained from the plurality of home devices 2a to 2e, and updates the value function of the plurality of home device agents 30a to 30e to minimize the power consumption in the plurality of home devices 2a to 2e by using the pay value as the parameter.”).]), the processor 
…
decides an optimal action of the first control target by using the first reward function information ([Maehara Figure 6, elements S7, S8, S9; paragraphs [0048]-[0049]: determining optimal behavior (“decides an optimal action …”) for each home device agent (“first control target”) based on the pay value (“reward”) from each home device agent through reinforcement learning and evaluation of each agents’ respective evaluation function (interpreted as the value function from Maehara paragraph [0040]) (“ … using the first reward function information”) (“When it is determined that the state change information (agent input value) is an optimization element (e.g., power consumption in the present exemplary embodiment) for an operation of the home device agents 30a to 30e (operation S7), the agent management unit 37 determines a pay value as a numerical value which increases as the pay value gets closer to the target value for an optimization element, and provides the pay value to the home device agents 30a to 30e to update the evaluation function of the home device agents 30a to 30e (operation S8). … When the state change information (agent input value) is information other than the optimization element (e.g., power consumption), the state change information is input to the home device agents 30a to 30e of all the home devices as a simple state change, and control commands Y1a to Y1e, which are issued in order to obtain optimal behavior, is obtained from the value function of each of the device agents 30a to 30e (operation S9). Furthermore, the control commands Y1a to Y1e for optimal behavior are transmitted to the output conversion unit 34.”).]).  
However, Maehara does not teach
… compares 
specifies a reward, which is reflected in the first reward function information from rewards set in the second reward function information, 
updates the first reward function information on the basis of the specified reward, …
Ribeiro teaches
… compares ([Ribeiro p.464 col.1 Section 1. Introduction, 1st paragraph: “Multiagent System (MAS) [8] may be formed by adaptive agents which interact and cooperate for the resolution of certain tasks using Reinforcement Learning (RL) algorithms.”] [Ribeiro p.465 col.1 Section 2. Cooperative Learning in MAS, 1st-3rd paragraphs: Markovian decision process (MDP) specifies <state S, action A, state transitional probability                         
                            
                                
                                    ∂
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                    , reward                         
                            
                                
                                    R
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                    , …> for each agent (“…we introduce briefly the MDP used to describe our environment. A MDP is a tuple (S, A,                         
                            
                                
                                    ∂
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                    ,                         
                            
                                
                                    R
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                    , 𝛄) where s ∈ S is a state that can be composed into a sequence of state variables = <x1,x2,…,xv>. An episode is a sequence of actions a ∈ A that leads the agent from a state sinitial to a state sgoal.                         
                            
                                
                                    ∂
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                     is a function that indicates the probability that the agent arrives in state s’ when an action a is applied in state s. Similarly,                         
                            
                                
                                    R
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                     is the reward received whenever the transition                         
                            
                                
                                    ∂
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                     occurs. …”).] [Ribeiro p.465 col.2 Section 3. Interaction Model, 1st paragraph: partial action policy Qi from each agent contains information about the environment (e.g., <state S, action A, state transitional probability                         
                            
                                
                                    ∂
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                    , reward                         
                            
                                
                                    R
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                    , …> for each agent) (“first reward function information”; “second reward function information”) (“The interaction in MA-RL can produce a refined set of behaviors obtained from the agents’ actions. Part of the behavioral set (i.e., a global action policy) is shared by the agents through a Partial Action Policy (Qi). Usually such partial policies contain incomplete information about the environment, but with an adequate interaction model they can be unified to maximize the sum of the partial rewards obtained during the learning process.”).] [Ribeiro p.466 col.1 1st paragraph (Section 3. Interaction Model); p.466 Algorithm 1; p.467 Algorithms 2 and 3: referring to MA-RL Algorithm 1 lines 10-17 at each step loop, each agent calculates a reward value based on their respective Qi and shares their reward function information (where Qi contains <state S, action A, state transitional probability                         
                            
                                
                                    ∂
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                    , reward                         
                            
                                
                                    R
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                    , …>; see Algorithm 1 lines 14-15), where cooperate() invokes a update_policy() with a cost function (see [Ribeiro p.467 Algorithm 3 lines 1-7]) that performs a comparison of the Qi from all agents against an agent that currently represents an optimal policy (see [Ribeiro p.466 Figure 1]); the fact that each agent performs the same cost comparison against an agent that represents an optimal policy in the same state is an indirect form of comparing Qi between any two agents at a specified state (“compares the conditions of the reward function information with each other”) (“Algorithm 1 presents the share_policy function which shares the agents’ learning information. … The best rewards are sent out to the GAP forming a set of the best acquired rewards by the agents. These rewards will be further shared with the other agents. … To estimate GAP with the best rewards, we will use a cost function which finds the best path between the initial states and goal state for a given policy. … We assume that A* produces a generative model governing the optimal policy Q*. We consider a policy as optimal when the number of right hits that an agent can obtain in a certain environment is the maximum possible. A right hit is obtained when the agent has the capacity of finding the goal-state with the lowest possible cost (relative to the cost provided by the A*). The cost is defined by the number of steps the agent needs to reach the goal-state and the sum of the existing costs in the path between each initial state and the goal-state [12]. Figure 1 shows a representative diagram to illustrate interaction among the agents.”).]), 
specifies a reward, which is reflected in the first reward function information from rewards set in the second reward function information ([Ribeiro p.467 col.1 last paragraph (Section 3.1. Cooperation Models); Algorithm 3: referring to Algorithm 3 lines 1-7, after performing the cost comparison, update_policy() identifies a Qi from any agent i that represents the lowest cost as a result of the comparison, and provides the corresponding calculated reward value                         
                            
                                
                                    
                                        
                                            Q
                                        
                                        ^
                                    
                                
                                
                                    i
                                
                            
                        
                     to the global action policy table (“specifies a reward …”), with the “reflected” action stated in the claim is interpreted as the act of performing the comparison performed in the update_policy() function for all agents and for all Qi (“ … reflected in the first reward function information from rewards set in the second reward function information”) (“…each agent sends out the value of the                         
                            
                                
                                    
                                        
                                            Q
                                        
                                        ^
                                    
                                
                                
                                    i
                                
                            
                        
                     to the GAP. If the reward value is suitable, i.e. it improves the efficiency of the other agents for the same state (algorithm 3 line 3) the agents will then share these rewards (algorithm 3 line 4). …”).]), 
updates the first reward function information on the basis of the specified reward ([Ribeiro p.466 Figure 1; p.466 col.1 2nd paragraph (Section 3.1. Cooperation Models); p.466 col.2 Algorithm 1: referring to Figure 1 and Algorithm 1 lines 18-20, the agent that currently represents the optimal policy shares the reward (“specified reward”) it learned from update_policy() function to the other agents, effectively updating their respective reward function information (“updates the first reward function information on the basis of the specified reward”) (“The diagram in Figure 1 shows how the agents keep up with the knowledge all along the interaction. The agent i employs the Q-learning algorithm to generate and store the rewards in                         
                            
                                
                                    
                                        
                                            Q
                                        
                                        ^
                                    
                                
                                
                                    i
                                
                            
                        
                     . When the Agent A* receives the rewards, it proceeds as follows: when the agent i goes from an initial state to the goal-state with the lowest cost, the agent will thus be able to share these rewards (accumulated rewards according to algorithm 2) with other agents using a cooperation model. When carried out the rewards exchange of each partial policy Qi, the agents can update their knowledge and interact into the environment using the GAP.”).] [Ribeiro p.465 col.2 Section 3. Interaction Model, 1st paragraph: the act of running the MA-RL algorithm on multiple agents through sharing and accumulating the best rewards is interpreted as performing a decision-making based on the Qi learned from each agent (“decides an optimal action … by using … reward function information”) (“The action policies are generated by the multiagent Q-learning algorithm, accumulating rewards and making the agents to converge to the optimal policy Q*. When policies Q1,…,Qx are unified, it is possible to come up a new policy namely Global Action Policy (GAP = {GAP1,…,GAPx}), in which GAPi denotes the best rewards acquired by the agent i during the learning process.”).]), …
Both Maehara and Ribeiro are analogous art since both teach the use of reinforcement learning in multi-agent systems.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the multi-agent reinforcement learning of Maehara and expand upon it by applying the multi-agent interactions and rewards-sharing techniques of Riberio as a way to exchange rewards and determine optimal actions in a multi-agent system. The motivation is taught in Ribeiro, as exchanging rewards and determining optimal actions in a multi-agent system is a complex task; using the current techniques to distribute the learning and foster cooperation allows agents to converge to a solution in a faster and more efficient manner, ([Ribeiro p.464 col.1 Abstract: “The exchange of rewards among the agents during the interaction is a complex task and if it is inadequate it may cause delays in learning or generate unexpected transitions, making the cooperation inefficient and converging to a non-satisfactory policy. In order to allow the interactive discovery of high quality policies we have developed several cooperation models based on the exchange of action policies between the agents. Experimental results have shown that the proposed cooperation models are able to speed up the convergence of the agents while achieving optimal action policies even in high-dimensional environments (e.g. traffic), outperforming the standard Q-learning algorithm.”]).
Regarding Claim 2, Maehara in view of Ribeiro teaches
The computer system according to claim 1, 
wherein the reward function information includes an entry constituted by a condition defined from at least one of 
the state and the action ([Maehara Figure 5; paragraph [0039]] [Ribeiro p.465 col.2 Section 3. Interaction Model, 1st paragraph: partial action policy Qi from each agent contains information about the environment (e..g, <state S, action A, state transitional probability                         
                            
                                
                                    ∂
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                    , reward                         
                            
                                
                                    R
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                    , …> for each agent), where S represents the state, and A represents the action (see [Ribeiro p.465 col.1 Section 2. Cooperative Learning in MAS, 1st-3rd paragraphs]).]), 
the reward given when the condition is satisfied ([Maehara Figure 5; paragraph [0039]] [Ribeiro p.465 col.2 Section 3. Interaction Model, 1st paragraph: partial action policy Qi from each agent contains information about the environment (e..g, <state S, action A, state transitional probability                         
                            
                                
                                    ∂
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                    , reward                         
                            
                                
                                    R
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                    , …> for each agent), where                          
                            
                                
                                    R
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                     represents the reward given when the condition is satisfied (see [Ribeiro p.465 col.1 Section 2. Cooperative Learning in MAS, 1st-3rd paragraphs]).]), and 
a confidence value indicating statistical confidence of the reward ([Ribeiro p.465 col.1 Section 2. Cooperative Learning in MAS, 1st-3rd paragraphs:                         
                            
                                
                                    ∂
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                     represents the state transition probability of an agent arriving at state s’ when an action a is applied in state s, with the optimal policy consisting of a set of optimal paths, where each optimal path is represented by π(s,a) for each state-action transition (“…we introduce briefly the MDP used to describe our environment. A MDP is a tuple (S, A,                         
                            
                                
                                    ∂
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                    ,                         
                            
                                
                                    R
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                     , γ) where s ∈ S is a state that can be composed into a sequence of state variables = <x1,x2,…,xv>. An episode is a sequence of actions a ∈ A that leads the agent from a state sinitial to a state sgoal.                         
                            
                                
                                    ∂
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                     is a function that indicates the probability that the agent arrives in state s’ when an action a is applied in state s. Similarly,                         
                            
                                
                                    R
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                     is the reward received whenever the transition                         
                            
                                
                                    ∂
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                      occurs … A RL agent must learn a policy π : S x A that maximizes its expected cumulative reward [1], where π (s,a) is the probability of selecting action a from state s.”).] [Ribeiro p.466 col.1 1st paragraph (Section 3. Interaction Model); p.466 Algorithm 1; p.467 Algorithms 2 and 3: referring to MA-RL Algorithm 1 lines 10-17 at each step loop, each agent calculates a reward value based on their respective Qi and shares their reward function information (where Qi contains <state S, action A, state transitional probability                         
                            
                                
                                    ∂
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                    , reward                         
                            
                                
                                    R
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                    , …>; see Algorithm 1 lines 14-15), where cooperate() invokes a update_policy() with a cost function (see [Ribeiro p.467 Algorithm 3 lines 1-7]) that performs a comparison of the Qi from all agents against an agent that currently represents an optimal policy (see [Ribeiro p.466 Figure 1]), with the cost function analyzing each Qi (which includes reward                         
                            
                                
                                    R
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                     and state transitional probability                         
                            
                                
                                    ∂
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                    ) at a given state s to determine the best optimal path from that state s (which reflects a form result returned from the cost function in the process of selecting a best path represents a confidence value based on a statistical calculation based on each respective Qi (“Algorithm 1 presents the share_policy function which shares the agents’ learning information. … The best rewards are sent out to the GAP forming a set of the best acquired rewards by the agents. These rewards will be further shared with the other agents. … To estimate GAP with the best rewards, we will use a cost function which finds the best path between the initial states and goal state for a given policy. … We assume that A* produces a generative model governing the optimal policy Q*. We consider a policy as optimal when the number of right hits that an agent can obtain in a certain environment is the maximum possible. A right hit is obtained when the agent has the capacity of finding the goal-state with the lowest possible cost (relative to the cost provided by the A*). The cost is defined by the number of steps the agent needs to reach the goal-state and the sum of the existing costs in the path between each initial state and the goal-state [12]. Figure 1 shows a representative diagram to illustrate interaction among the agents.”).]), and 
the processor 
specifies two entries to be compared with reference to the condition of the first reward function information and the condition of the second reward function information ([Ribeiro p.466 col.1 1st paragraph (Section 3. Interaction Model); p.466 Algorithm 1; p.467 Algorithms 2 and 3: referring to MA-RL Algorithm 1 lines 10-17 at each step loop, each agent calculates a reward value based on their respective Qi (“condition of the first reward function information”; “condition of the second reward function information”) and shares their reward function information (where Qi contains <state S, action A, state transitional probability                         
                            
                                
                                    ∂
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                    , reward                         
                            
                                
                                    R
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                    , …>; see Algorithm 1 lines 14-15), where cooperate() invokes a update_policy() with a cost function (see [Ribeiro p.467 Algorithm 3 lines 1-7]) that performs a comparison of the Qi from [Ribeiro p.466 Figure 1]); the fact that each agent performs the same cost comparison against an agent that represents an optimal policy on the same state is an indirect form of comparing Qi between any two agents at a specified state (“specifies two entries to be compared in reference to the condition of the first reward information and the condition of the second reward function information”) (“Algorithm 1 presents the share_policy function which shares the agents’ learning information. … The best rewards are sent out to the GAP forming a set of the best acquired rewards by the agents. These rewards will be further shared with the other agents. … To estimate GAP with the best rewards, we will use a cost function which finds the best path between the initial states and goal state for a given policy. … We assume that A* produces a generative model governing the optimal policy Q*. We consider a policy as optimal when the number of right hits that an agent can obtain in a certain environment is the maximum possible. A right hit is obtained when the agent has the capacity of finding the goal-state with the lowest possible cost (relative to the cost provided by the A*). The cost is defined by the number of steps the agent needs to reach the goal-state and the sum of the existing costs in the path between each initial state and the goal-state [12]. Figure 1 shows a representative diagram to illustrate interaction among the agents.”).]), 
determines whether to allow the reward included in an entry specified from the second reward function information to be reflected as the reward included in an entry specified from the first reward function information on the basis of the confidence value of the specified two entries ([Ribeiro p.465 col.1 Section 2. Cooperative Learning in MAS, 1st-3rd paragraphs:                         
                            
                                
                                    ∂
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                     represents the state transition probability of an agent arriving at state s’ when an action a is applied in state s, with the optimal policy consisting of a set of optimal paths, where each optimal path is represented by π(s,a) for each state-action transition (“…we introduce briefly the MDP used to describe our environment. A MDP is a tuple (S, A,                         
                            
                                
                                    ∂
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                    ,                         
                            
                                
                                    R
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                     , γ) where s ∈ S is a state that can be composed into a sequence of state variables = <x1,x2,…,xv>. An episode is a sequence of actions a ∈ A that leads the agent from a state sinitial to a state sgoal.                         
                            
                                
                                    ∂
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                     is a function that indicates the probability that the agent arrives in state s’ when an action a is applied in state s. Similarly,                         
                            
                                
                                    R
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                     is the reward received whenever the transition                         
                            
                                
                                    ∂
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                      occurs … A RL agent must learn a policy π : S x A that maximizes its expected cumulative reward [1], where π (s,a) is the probability of selecting action a from state s.”).] [Ribeiro p.466 col.1 1st paragraph (Section 3. Interaction Model); p.466 Algorithm 1; p.467 Algorithms 2 and 3: referring to MA-RL Algorithm 1 lines 10-17 at each step loop, each agent calculates a reward value based on their respective Qi and shares their reward function information (where Qi contains <state S, action A, state transitional probability                         
                            
                                
                                    ∂
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                    , reward                         
                            
                                
                                    R
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                    , …>; see Algorithm 1 lines 14-15), where cooperate() invokes a update_policy() with a cost function (see [Ribeiro p.467 Algorithm 3 lines 1-7]) that performs a comparison of the Qi from all agents against an agent that currently represents an optimal policy (see [Ribeiro p.466 Figure 1]), with the cost function analyzing each Qi (which includes reward                         
                            
                                
                                    R
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                     and state transitional probability                         
                            
                                
                                    ∂
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                    ) at a given state s to determine the best optimal path from that state s (which reflects a form of probability/statistical analysis); hence the result returned from the cost function in the process of selecting a best path is interpreted as a “high” confidence value based on a statistical calculation based on each respective Qi (“Algorithm 1 presents the share_policy function which shares the agents’ learning information. … The best rewards are sent out to the GAP forming a set of the best acquired rewards by the agents. These rewards will be further shared with the other agents. … To estimate GAP with the best rewards, we will use a cost function which finds the best path between the initial states and goal state for a given policy. … We assume that A* produces a generative model governing the optimal policy Q*. We consider a policy as optimal when the number of right hits that an agent can obtain in a certain environment is the maximum possible. A right hit is obtained when the agent has the capacity of finding the goal-state with the lowest possible cost (relative to the cost provided by the A*). The cost is defined by the number of steps the agent needs to reach the goal-state and the sum of the existing costs in the path between each initial state and the goal-state [12]. Figure 1 shows a representative diagram to illustrate interaction among the agents.”).] [Ribeiro p.467 col.1 last paragraph (Section 3.1. Cooperation Models); Algorithm 3: referring to Algorithm 3 lines 1-7, after performing the cost comparison, update_policy() identifies a Qi from any agent i that represents the lowest cost as a result of the comparison, and provides the corresponding calculated reward value                         
                            
                                
                                    
                                        
                                            Q
                                        
                                        ^
                                    
                                
                                
                                    i
                                
                            
                        
                     to the global action policy table (“specifies a reward”), with the “reflected” action stated in the claim is interpreted as the act of performing the comparison performed in the update_policy() function for all agents and for all Qi (“determines whether to allow the reward …to be reflected … on the basis of the confidence value of the specified two entries”) (“…each agent sends out the value of the                         
                            
                                
                                    
                                        
                                            Q
                                        
                                        ^
                                    
                                
                                
                                    i
                                
                            
                        
                     to the GAP. If the reward value is suitable, i.e. it improves the efficiency of the other agents for the same state (algorithm 3 line 3) the agents will then share these rewards (algorithm 3 line 4). …”).]), and 
sets the reward included in the entry specified from the second reward function information in the entry specified from the first reward function information when the reward included in the entry specified from the second reward function information is reflected as the reward included in the entry specified from the first reward function information ([Ribeiro p.466 col.1 2nd paragraph (Section 3.1. Cooperation Models); p.466 Algorithm 1: referring to Figure 1 and Algorithm 1 lines 18-20, the agent that currently represents the optimal policy shares the reward it learned from update_policy() function to the other agents (“sets the reward …”) (“The diagram in Figure 1 shows how the agents keep up with the knowledge all along the interaction. The agent i employs the Q-learning algorithm to generate and store the rewards in                         
                            
                                
                                    
                                        
                                            Q
                                        
                                        ^
                                    
                                
                                
                                    i
                                
                            
                        
                     . When the Agent A* receives the rewards, it proceeds as follows: when the agent i goes from an initial state to the goal-state with the lowest cost, the agent will thus be able to share these rewards (accumulated rewards according to algorithm 2) with other agents using a cooperation model. When carried out the rewards exchange of each partial policy Qi, the agents can update their knowledge and interact into the environment using the GAP.”).] [Ribeiro p.645 col.2 Section 3. Interaction Model, 1st paragraph: the act of running the MA-RL algorithm on multiple agents through sharing and accumulating the best rewards is a form of decision-making based on the Qi learned from each agent (“decides an optimal action … using … reward function information”) (“The action policies are generated by the multiagent Q-learning algorithm, accumulating rewards and making the agents to converge to the optimal policy Q*. When policies Q1,…,Qx are unified, it is possible to come up a new policy namely Global Action Policy (GAP = {GAP1,…,GAPx}), in which GAPi denotes the best rewards acquired by the agent i during the learning process.”).]).  
Regarding Claim 3, Maehara in view of Ribeiro teaches
The computer system according to claim 2, 
wherein the at least one computer manages attribute information for managing an attribute of a state constituting a condition of the plurality of pieces of reward function information ([Maehara paragraph [0044]: assigning additional identification to agent input values (profile information or state change information) (“Next, the input conversion unit 33 obtains agent input values X1a to X1e from input protocols Xa to Xe (operation S2) by using the protocol analysis unit 35. The input conversion unit 33 determines whether the agent input values X1a to X1e are profile information of home devices 2a to 2e or state change information related to the state change of the home devices (operation S3).] [Maehara paragraph [0048]: detecting an attribute (power consumption) associated with state change information (“attribute information for managing an attribute of a state”) (“When it is determined that the state change information (agent input value) is an optimization element (e.g., power consumption in the present exemplary embodiment) for an operation of the home device agents 30a to 30e…”).]), and 
the processor 
selects a first entry from the first reward function information, specifies a second entry, in which a condition similar to a condition included in the first entry is set, from entries included in the second reward function information on the basis of the attribute information ([Maehara paragraph [0048]: detecting an attribute (power consumption) associated with state change information (“When it is determined that the state change information (agent input value) is an optimization element (e.g., power consumption in the present exemplary embodiment) for an operation of the home device agents 30a to 30e…”).] [Maehara Figure 6, elements S7, S8, S9; paragraphs [0048]-[0049]: determining optimal behavior for each home device agent based on the pay value (“reward”) from each home device agent through reinforcement learning and evaluation of each agents’ respective evaluation function (interpreted as the value function from Maehara paragraph [0040]) (“… on the basis of the attribute information”) (“When it is determined that the state change information (agent input value) is an optimization element (e.g., power consumption in the present exemplary embodiment) for an operation of the home device agents 30a to 30e (operation S7), the agent management unit 37 determines a pay value as a numerical value which increases as the pay value gets closer to the target value for an optimization element, and provides the pay value to the home device agents 30a to 30e to update the evaluation function of the home device agents 30a to 30e (operation S8). … When the state change information (agent input value) is information other than the optimization element (e.g., power consumption), the state change information is input to the home device agents 30a to 30e of all the home devices as a simple state change, and control commands Y1a to Y1e, which are issued in order to obtain optimal behavior, is obtained from the value function of each of the device agents 30a to 30e (operation S9). Furthermore, the control commands Y1a to Y1e for optimal behavior are transmitted to the output conversion unit 34.”).] [Ribeiro p.466 col.1 1st paragraph (Section 3. Interaction Model); p.466 Algorithm 1; p.467 Algorithms 2 and 3: referring to MA-RL Algorithm 1 lines 10-17 at each step loop, each agent calculates a reward value based on their respective Qi (“selects a first entry from the first reward function information, specifies a second entry, … from entries included in the second reward function information …”) and shares their reward function information (where Qi contains <state S, action A, state transitional probability                         
                            
                                
                                    ∂
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                    , reward                         
                            
                                
                                    R
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                    , …>; see Algorithm 1 lines 14-15), where cooperate() invokes a update_policy() with a cost function (see [Ribeiro p.467 Algorithm 3 lines 1-7]) that performs a comparison of the Qi from all agents against an agent that currently represents an optimal policy (see [Ribeiro p.466 Figure 1]); the fact that each agent performs the same cost comparison against an agent that represents an optimal policy on the same state is an indirect form of comparing Qi between any two agents at a specified state  (“Algorithm 1 presents the share_policy function which shares the agents’ learning information. … The best rewards are sent out to the GAP forming a set of the best acquired rewards by the agents. These rewards will be further shared with the other agents. … To estimate GAP with the best rewards, we will use a cost function which finds the best path between the initial states and goal state for a given policy. … We assume that A* produces a generative model governing the optimal policy Q*. We consider a policy as optimal when the number of right hits that an agent can obtain in a certain environment is the maximum possible. A right hit is obtained when the agent has the capacity of finding the goal-state with the lowest possible cost (relative to the cost provided by the A*). The cost is defined by the number of steps the agent needs to reach the goal-state and the sum of the existing costs in the path between each initial state and the goal-state [12]. Figure 1 shows a representative diagram to illustrate interaction among the agents.”).]), and 
determines whether to allow the reward included in the second entry to be reflected in the reward included in the first entry on the basis of the confidence value included in the first entry and the confidence value included in the second entry ([Ribeiro p.466 col.1 1st paragraph (Section 3. Interaction Model): cost function analyzes each Qi (which includes reward                         
                            
                                
                                    R
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                     and state transitional probability                         
                            
                                
                                    ∂
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                    ) at a given state s to determine the best optimal path from that state s (which reflects a form of probability/statistical analysis); hence the result returned from the cost function in the process of selecting a best path is interpreted as a “high” confidence value based on a statistical calculation based on each respective Qi (“determines whether to allow the reward … on the basis of the confidence value included in the first entry and the confidence value included in the second entry”) (“Algorithm 1 presents the share_policy function which shares the agents’ learning information. … The best rewards are sent out to the GAP forming a set of the best acquired rewards by the agents. These rewards will be further shared with the other agents. … To estimate GAP with the best rewards, we will use a cost function which finds the best path between the initial states and goal state for a given policy. … We assume that A* produces a generative model governing the optimal policy Q*. We consider a policy as optimal when the number of right hits that an agent can obtain in a certain environment is the maximum possible. A right hit is obtained when the agent has the capacity of finding the goal-state with the lowest possible cost (relative to the cost provided by the A*). The cost is defined by the number of steps the agent needs to reach the goal-state and the sum of the existing costs in the path between each initial state and the goal-state [12]. Figure 1 shows a representative diagram to illustrate interaction among the agents.”).]).  
Regarding Claim 4, Maehara in view of Ribeiro teaches
The computer system according to claim 3, 
wherein the processor 
decides a combination of optimal actions of the first control target by using the first reward function information ([Ribeiro p.465 col.2 Section 3. Interaction Model, 1st paragraph: a global action policy determined through a collection of best rewards received from each agent (“… optimal actions of the first control target by using the first reward function information”) (“When policies Q1,…,Qx are unified, it is possible to come up a new policy namely Global Action Policy (GAP = {GAP1,…,GAPx}), in which GAPi denotes the best rewards acquired by the agent i during the learning process.”).] [Ribeiro p.466 col.2 last paragraph: “a Global Action Policy GAP = {GAP1,…, GAPx}, where GAPi represents the partial policy of the agent i;”).] [Ribeiro p.466 col.1 2nd paragraph (Section 3.1. Cooperation Models); p.466 Algorithm 1: referring to Figure 1 and Algorithm 1 lines 18-20, the agent that currently represents the optimal policy shares the reward it learned from update_policy() function to the other agents; running through MA-RL Algorithm 1 (see [Ribeiro p.466 col.1 1st paragraph (Section 3. Interaction Model); p.466 Algorithm 1; p.467 Algorithms 2 and 3]) will eventually determine the global action policy (“decides a combination of optimal actions of the first control target by using the first reward function information”) (“The diagram in Figure 1 shows how the agents keep up with the knowledge all along the interaction. The agent i employs the Q-learning algorithm to generate and store the rewards in                         
                            
                                
                                    
                                        
                                            Q
                                        
                                        ^
                                    
                                
                                
                                    i
                                
                            
                        
                     . When the Agent A* receives the rewards, it proceeds as follows: when the agent i goes from an initial state to the goal-state with the lowest cost, the agent will thus be able to share these rewards (accumulated rewards according to algorithm 2) with other agents using a cooperation model. When carried out the rewards exchange of each partial policy Qi, the agents can update their knowledge and interact into the environment using the GAP.”).]), 
calculates an association value indicating an association between the condition and an objective variable value for defining the optimal actions of the first control target ([Ribeiro p.466 col.1 1st paragraph (Section 3. Interaction Model): cost function takes as input Qi and specified state (see [Ribeiro p.467 Algorithm 3 lines 1-7]), where the cost depends on the the number of steps between the goal state and the specified state is interpreted as an association between the condition (represented by the specified state) and an objective variable value (interpreted as a goal-state) (“calculates an association value …”) (“The cost is defined by the number of steps the agent needs to reach the goal-state and the sum of the existing costs in the path between each initial state and the goal-state [12].”).]), and 
calculates the confidence value on the basis of the association value ([Ribeiro p.466 col.1 1st paragraph (Section 3. Interaction Model): cost function analyzes each Qi (which includes reward                         
                            
                                
                                    R
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                     and state transitional probability                         
                            
                                
                                    ∂
                                
                                
                                    s
                                    ,
                                    s
                                    '
                                
                                
                                    a
                                
                            
                        
                    ) at a given state s to determine the best optimal path from that state s (which reflects a form of probability/statistical analysis); hence the result returned from the cost function in the process of selecting a best path is interpreted as a “high” confidence value based on a statistical calculation based on each respective Qi, with the result of the cost function based on the association value (“calculates the confidence value on the basis of the association value”) (“Algorithm 1 presents the share_policy function which shares the agents’ learning information. … The best rewards are sent out to the GAP forming a set of the best acquired rewards by the agents. These rewards will be further shared with the other agents. … To estimate GAP with the best rewards, we will use a cost function which finds the best path between the initial states and goal state for a given policy. … We assume that A* produces a generative model governing the optimal policy Q*. We consider a policy as optimal when the number of right hits that an agent can obtain in a certain environment is the maximum possible. A right hit is obtained when the agent has the capacity of finding the goal-state with the lowest possible cost (relative to the cost provided by the A*). The cost is defined by the number of steps the agent needs to reach the goal-state and the sum of the existing costs in the path between each initial state and the goal-state [12]. Figure 1 shows a representative diagram to illustrate interaction among the agents.”).]).  
Regarding Claim 7, Maehara teaches
A control method in a computer system that performs optimization of each of a plurality of control targets, 
the computer system including at least one computer including a processor and a memory connected to the processor (This claim element in similar in scope to a corresponding claim element in Claim 1, and hence is rejected under similar rationale.), 
a plurality of pieces of reward function information for defining rewards for states and actions of the control targets being managed for each of the control targets (This claim element in similar in scope to a corresponding claim element in Claim 1, and hence is rejected under similar rationale.), 
the plurality of pieces of reward function information including first reward function information for defining the reward of a first control target and second reward function information for defining the reward of a second control target (This claim element in similar in scope to a corresponding claim element in Claim 1, and hence is rejected under similar rationale.), 
the control method comprising:
…
a third step in which the processor decides an optimal action of the first control target by using the first reward function information (This claim element in similar in scope to a corresponding claim element in Claim 1, and hence is rejected under similar rationale.).  
However, Maehara does not teach
… 
a first step in which, when updating the first reward function information, the processor 
compares (This claim element in similar in scope to a corresponding claim element in Claim 1, and hence is rejected under similar rationale.), 
specifies a reward, which is reflected in the first reward function information from rewards set in the second reward function information (This claim element in similar in scope to a corresponding claim element in Claim 1, and hence is rejected under similar rationale.);
a second step in which the processor updates the first reward function information on the basis of the specified reward (This claim element in similar in scope to a corresponding claim element in Claim 1, and hence is rejected under similar rationale.); …
Ribeiro teaches
a first step in which, when updating the first reward function information (Under its broadest reasonable interpretation, this claim limitation in a method claim recites a contingent clause that effectively renders the subsequent claim language to not be performed because the condition precedent (“when updating the first reward function information” is not required to be met, and the claimed invention can be practiced without the condition occurring. See MPEP 2111.04(II). Applicant is advised to amend the claim to positively cite the condition as being fulfilled, since no patentable weight is given for the subsequent claim language following a contingent clause that does not require the condition to be fulfilled for practicing the claimed invention. However, for the purposes of examination, this contingent clause will be treated as if the condition were fulfilled.), the processor 
compares (This claim element in similar in scope to a corresponding claim element in Claim 1, and hence is rejected under similar rationale.), 
specifies a reward, which is reflected in the first reward function information from rewards set in the second reward function information (This claim element in similar in scope to a corresponding claim element in Claim 1, and hence is rejected under similar rationale.);
a second step in which the processor updates the first reward function information on the basis of the specified reward (This claim element in similar in scope to a corresponding claim element in Claim 1, and hence is rejected under similar rationale.); …
Both Maehara and Ribeiro are analogous art since both teach the use of reinforcement learning in multi-agent systems.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the multi-agent reinforcement learning of Maehara and expand upon it by applying the multi-agent interactions and rewards-sharing techniques of Riberio as a way to exchange rewards and determine optimal actions in a multi-agent system. The motivation is taught in Ribeiro, as exchanging rewards and determining optimal actions in a multi-agent system is a complex task; using the current techniques to distribute the learning and foster cooperation allows agents to converge to a solution in a faster and more efficient manner, thus improving the performance of the system ([Ribeiro p.464 col.1 Abstract: “The exchange of rewards among the agents during the interaction is a complex task and if it is inadequate it may cause delays in learning or generate unexpected transitions, making the cooperation inefficient and converging to a non-satisfactory policy. In order to allow the interactive discovery of high quality policies we have developed several cooperation models based on the exchange of action policies between the agents. Experimental results have shown that the proposed cooperation models are able to speed up the convergence of the agents while achieving optimal action policies even in high-dimensional environments (e.g. traffic), outperforming the standard Q-learning algorithm.”]).
Regarding Claim 8, Maehara in view of Ribeiro teaches
The control method according to claim 7, 
wherein the reward function information includes an entry constituted by a condition defined from at least one of 
the state and the action (This claim element in similar in scope to a corresponding claim element in Claim 2, and hence is rejected under similar rationale.), 
the reward given when the condition is satisfied, and 
a confidence value indicating statistical confidence of the reward,
the first step includes 
a step in which the processor specifies two entries to be compared with reference to the condition of the first reward function information and the condition of the second reward function information (This claim element in similar in scope to a corresponding claim element in Claim 2, and hence is rejected under similar rationale.), and 
a step in which the processor determines whether to allow the reward included in an entry specified from the second reward function information to be reflected as the reward included in an entry specified from the first reward function information on the basis of the confidence value of the specified two entries (This claim element in similar in scope to a corresponding claim element in Claim 2, and hence is rejected under similar rationale.), and
the second step includes a step in which, when the reward included in the entry specified from the second reward function information is reflected as the reward included in the entry specified from the first reward function information (Under its broadest reasonable interpretation, this claim limitation in a method claim recites a contingent clause that effectively renders the subsequent claim language to not be performed because the condition precedent (“when the reward included in the entry specified from the second reward function is reflected as the reward included in the entry specified from the first reward function information” is not required to be met, and the claimed invention can be practiced without the condition occurring. See MPEP 2111.04(II). Applicant is advised to amend the claim to positively cite the condition as being fulfilled, since no patentable weight is given for the subsequent claim language following a contingent clause that does not require the condition to be fulfilled for practicing the claimed invention. However, for the purposes of examination, this contingent clause will be treated as if the condition were fulfilled.) (This claim element in similar in scope to a corresponding claim element in Claim 2, and hence is rejected under similar rationale.), 
the processor sets the reward included in the entry specified from the second reward function information in the entry specified from the first reward function information (This claim element in similar in scope to a corresponding claim element in Claim 2, and hence is rejected under similar rationale.).  
Regarding Claim 9, Maehara in view of Ribeiro teaches
The control method according to claim 8, 
wherein the at least one computer manages attribute information for managing an attribute of a state constituting a condition of the plurality of pieces of reward function information (This claim element in similar in scope to a corresponding claim element in Claim 3, and hence is rejected under similar rationale.), and
the first step includes 
a step in which the processor selects a first entry from the first reward function information (This claim element in similar in scope to a corresponding claim element in Claim 3, and hence is rejected under similar rationale.), 
a step in which the processor specifies a second entry, in which a condition similar to a condition included in the first entry is set, from entries included in the second reward function information on the basis of the attribute information (This claim element , and 
a step in which the processor determines whether to allow the reward included in the second entry to be reflected in the reward included in the first entry on the basis of the confidence value included in the first entry and the confidence value included in the second entry (This claim element in similar in scope to a corresponding claim element in Claim 3, and hence is rejected under similar rationale.).  
Regarding Claim 10, Maehara in view of Ribeiro teaches
The control method according to claim 9, further comprising:
a step in which the processor 
decides a combination of optimal actions of the first control target by using the first reward function information (This claim element in similar in scope to a corresponding claim element in Claim 4, and hence is rejected under similar rationale.), and 
calculates an association value indicating an association between the condition and an objective variable value for defining the optimal actions of the first control target (This claim element in similar in scope to a corresponding claim element in Claim 4, and hence is rejected under similar rationale.); and 
a step in which the processor calculates the confidence value on the basis of the association value (This claim element in similar in scope to a corresponding claim element in Claim 4, and hence is rejected under similar rationale.).  
Claims 5-6 and 11-12 are rejected under 35 U.S.C. 103 as being unpatentable over Maehara, Masakazu, U.S. PGPUB 2014/0135952, published 5/15/2014 [hereafter referred as Maehara] in view of Ribeiro et al., Interaction Models for Multiagent Reinforcement Learning, CIMCA 2008, IAWTIC 2008, and ISE 2008, IEEE, pp.464-469 [hereafter referred as Ribeiro]; in further view of Dusparic et al., U.S. PGPUB 2013/0176146, Decentralised Autonomic System .
Regarding Claim 5, Maehara in view of Ribeiro teaches
The computer system according to claim 4, 
wherein the processor 
decides the combination of the optimal actions of the first control target by using the first reward function information ([Ribeiro p.465 col.2 Section 3. Interaction Model, 1st paragraph: a global action policy determined through a collection of best rewards received from each agent (“… optimal actions of the first control target by using the first reward function information”) (“When policies Q1,…,Qx are unified, it is possible to come up a new policy namely Global Action Policy (GAP = {GAP1,…,GAPx}), in which GAPi denotes the best rewards acquired by the agent i during the learning process.”).] [Ribeiro p.466 col.2 last paragraph: “a Global Action Policy GAP = {GAP1,…, GAPx}, where GAPi represents the partial policy of the agent i;”).] [Ribeiro p.466 col.1 2nd paragraph (Section 3.1. Cooperation Models); p.466 Algorithm 1: referring to Figure 1 and Algorithm 1 lines 18-20, the agent that currently represents the optimal policy shares the reward it learned from update_policy() function to the other agents; running through MA-RL Algorithm 1 (see [Ribeiro p.466 col.1 1st paragraph (Section 3. Interaction Model); p.466 Algorithm 1; p.467 Algorithms 2 and 3]) will eventually determine the global action policy (“decides a combination of optimal actions of the first control target by using the first reward function information”) (“The diagram in Figure 1 shows how the agents keep up with the knowledge all along the interaction. The agent i employs the Q-learning algorithm to generate and store the rewards in                 
                    
                        
                            
                                
                                    Q
                                
                                ^
                            
                        
                        
                            i
                        
                    
                
             . When the Agent A* receives the rewards, it proceeds as follows: when the agent i goes from an initial state to the goal-state with the lowest cost, the agent will thus be able to share these rewards (accumulated rewards according to algorithm 2) with other agents using a cooperation model. When carried out the rewards exchange of each partial policy Qi, the agents can update their knowledge and interact into the environment using the GAP.”).]), …
However, Maehara in view of Ribeiro does not teach
calculates a contribution value indicating a magnitude of contribution to the objective variable value of the reward included in the first entry, and 
updates the confidence value included in the first entry on the basis of the contribution value.  
Dusparic teaches
calculates a contribution value indicating a magnitude of contribution to the objective variable value of the reward included in the first entry ([Dusparic paragraph [0059]: a W-value representing an importance of the policy’s current state; in the context of Maehara in view of Ribeiro, this W-value is associated with a policy’s current state (“objective variable value”), with being associated with a policy is interpreted as being included as part of each agent’s reward function information (“Each local policy suggests an action for execution at the next time step, together with the associated W-value (importance) of the policy's current state.”).] [Dusparic Figure 6; paragraph [0064]: a cooperation coefficient C (see [Dusparic paragraph [0034]) is also a learned value between agents through Q-learning; in the context of Maehara in view of Ribeiro, this C value is also learned as part of policy learning and is part of each agent’s reward function information (“FIG. 6 illustrates how the value of C can be learnt per agent, i.e., an agent uses the same C to scale W-values received for all policies on all one-hop neighbours; … or can be learnt per remote policy, i.e., the number of Q-learning processes that are learning C on an agent is the sum of the number of all policies that all of its one-hop neighbours implement, agent learns a separate C to scale W-values received from each policy on each neighbour. … FIG. 6 shows an example where a different C is learnt for each remote policy.”).] [Dusparic Figure 3, elements RP11, RP32; paragraph [0060]: multiplying a W-value with a cooperation coefficient C to determine the calculates a contribution value indicating a magnitude of contribution to the objective variable value of the reward included in the first entry”) (“Each policy, both local and remote, suggests an action, as selected based on the outcome of the ongoing Q-learning processes associated with each policy. Each action is also associated with its current importance, expressed as a learnt W-value, learnt as the outcome of the ongoing W-learning processes associated with each policy. The action that is executed on each agent is the one with the highest current importance, i.e., the highest W-value, after remote policies' W-values have been multiplied by the cooperation coefficient C, where 0<=C<=1. C is introduced to enable a local agent to give a varying degree of importance to the neighbours' action preferences. C can range from a fully non-cooperative value, C=0, where an agent does not consider neighbours' action preferences at all, to a fully cooperative, C=1, where neighbours' preferences matter as much as local ones.”).]), and
updates the confidence value included in the first entry on the basis of the contribution value ([Dusparic paragraph [0057]-[0058]: running a distributed W-learning algorithm based on a distributed Q-learning reinforcement learning algorithm, where an importance parameter W-value reflecting the importance of an action in a current state is also shared (“In DWL, each junction implements a Q-learning RL process model whereby it receives information on current traffic conditions from available sensors, maps that information to one of the available system state representations, and executes the action (set of traffic light sequences) that is has learnt to be the most suitable in the long-term for the given traffic conditions.”).] [Dusparic Figure 2; paragraph [0059]: DWL algorithm performing updates to learn optimal actions and the importance of actions (“After initialization, at each time step (depicted in FIG. 2), each local policy on each agent observes its local environmental conditions, maps them to a state representation, and performs updates on its local processes that learn the optimal actions and the importance of executing the preferred actions in each state. … Each agent also receives from each of its neighbours' state information ( a representation of the neighbours' environment conditions) for each of its policies, and based on that information performs updates on its remote processes that learn the optimal actions and the importance of executing the preferred actions in each particular state.”).] [Dusparic Figure 3; paragraph [0060]: W-values are shared and learned between agents using Q-learning algorithms; in the context of Maehara in view of Ribeiro, the W-value is part of the reward function information Qi and therefore is analyzed and updated as part of the cost function analysis (see [Ribeiro p.466 col.1 1st paragraph (Section 3. Interaction Model)]) (“updates the confidence value included in the first entry on the basis of the contribution value”) (“At each time step, each agent makes a decision as to what actions to execute (i.e., what traffic-control signal settings to deploy) based on optimal actions learnt for its local and remote policies. Each policy, both local and remote, suggests an action, as selected based on the outcome of the ongoing Q-learning processes associated with each policy. Each action is also associated with its current importance, expressed as a learnt W-value, learnt as the outcome of the ongoing W-learning processes associated with each policy. The action that is executed on each agent is the one with the highest current importance …”).]).  
Both Maehara in view of Ribeiro and Dusparic are analogous art since both teach the use of reinforcement learning in multi-agent systems.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the multi-agent reinforcement learning of Maehara in view of Ribeiro and expand upon the reinforcement learning by incorporating the distributed W-learning techniques of using W-values and cooperation coefficient values of Dusparic as a way to learn and apply contributions between agents in a multi-agent system. The motivation is taught in Dusparic, as distributed W-learning allows agents the run-time ability to directly assign and communicate certain importance to actions resulting from states, so that certain actions ([Dusparic paragraph [0026]: “The main operational advantage of the system of the invention is that it utilizes machine learning to learn appropriate behaviours … It removes the need for extensive preconfiguration, as the agents or nodes can configure themselves based on the observed conditions and learnt behaviours, reducing the configuration, deployment, and operational time and costs. … Using remote learning, each junction can automatically learn dependencies between neighbouring junctions, i.e., the effect of one junction's traffic light settings on another for a particular set of traffic conditions, removing the need for manual analysis, …”]).
Regarding Claim 6, Maehara in view of Ribeiro, in further view of Dusparic teaches
The computer system according to claim 5, 
wherein when the contribution value is a value indicating that contribution of the reward included in the first entry to the objective variable value is small ([Dusparic paragraph [0063]: actions associated with the state with highest W-value (“contribution value”) is executed, with the converse being interpreted that a lowest W-value will not be executed (“wherein when the contribution value is a value indicating that the contribution … is small”); in the context of Maehara in view of Ribeiro, this translates as a result of the cost function having a large value during the comparison such that a particular action is not considered as part of the optimal action (see [Ribeiro p.466 col.1 1st paragraph (Section 3. Interaction Model)]) (“FIG. 5 shows the action associated with the state with the highest current W-value is executed, and the outcome of that action is observed on all local policies and on all policies on all one-hop neighbours, i.e., the rewards received by all local policies and by all policies on all one-hop neighbours are added up and used to update the value of C used in the last time step using a learning process.”).]), 
the processor updates the confidence value included in the first entry to a value indicating that statistical confidence is low ([Ribeiro p.466 col.1 1st paragraph (Section 3. Interaction Model); p.466 Algorithm 1; p.467 Algorithms 2 and 3: referring to MA-RL Algorithm 1 lines 10-17 at each step loop, each agent calculates a reward value based on their respective Qi and shares their reward function information (where Qi contains <state S, action A, state transitional probability                 
                    
                        
                            ∂
                        
                        
                            s
                            ,
                            s
                            '
                        
                        
                            a
                        
                    
                
            , reward                 
                    
                        
                            R
                        
                        
                            s
                            ,
                            s
                            '
                        
                        
                            a
                        
                    
                
            , …>; see Algorithm 1 lines 14-15), where cooperate() invokes a update_policy() with a cost function (see [Ribeiro p.467 Algorithm 3 lines 1-7]) that performs a comparison of the Qi from all agents against an agent that currently represents an optimal policy (see [Ribeiro p.466 Figure 1]), with the cost function analyzing each Qi (which includes reward                 
                    
                        
                            R
                        
                        
                            s
                            ,
                            s
                            '
                        
                        
                            a
                        
                    
                
             and state transitional probability                 
                    
                        
                            ∂
                        
                        
                            s
                            ,
                            s
                            '
                        
                        
                            a
                        
                    
                
            ) at a given state s to determine the best optimal path from that state s (which reflects a form of probability/statistical analysis); hence the result returned from the cost function in the process of selecting a best path is interpreted as a “high” confidence value based on a statistical calculation based on each respective Qi, with the converse being a sub-optimal path is represented by a “low” confidence value based on a statistical calculation (“processor updates the confidence value included in the first entry to a value indicating that the statistical confidence is low”) (“Algorithm 1 presents the share_policy function which shares the agents’ learning information. … The best rewards are sent out to the GAP forming a set of the best acquired rewards by the agents. These rewards will be further shared with the other agents. … To estimate GAP with the best rewards, we will use a cost function which finds the best path between the initial states and goal state for a given policy. … We assume that A* produces a generative model governing the optimal policy Q*. We consider a policy as optimal when the number of right hits that an agent can obtain in a certain environment is the maximum possible. A right hit is obtained when the agent has the capacity of finding the goal-state with the lowest possible cost (relative to the cost provided by the A*). The cost is defined by the number of steps the agent needs to reach the goal-state and the sum of the existing costs in the path between each initial state and the goal-state [12]. Figure 1 shows a representative diagram to illustrate interaction among the agents.”).]).  
Regarding Claim 11, Maehara in view of Ribeiro teaches
The control method according to claim 10, further comprising:
a fourth step in which the processor 
decides the combination of the optimal actions of the first control target by using the first reward function information (This claim element in similar in scope to a corresponding claim element in Claim 5, and hence is rejected under similar rationale.), …
	However, Maehara in view of Ribeiro does not teach
calculates a contribution value indicating a magnitude of contribution to the objective variable value of the reward included in the first entry (This claim element in similar in scope to a corresponding claim element in Claim 5, and hence is rejected under similar rationale.); and 
a fifth step in which the processor updates the confidence value included in the first entry on the basis of the contribution value.
	Dusparic teaches 
calculates a contribution value indicating a magnitude of contribution to the objective variable value of the reward included in the first entry (This claim element in similar in scope to a corresponding claim element in Claim 5, and hence is rejected under similar rationale.); and 
a fifth step in which the processor updates the confidence value included in the first entry on the basis of the contribution value (This claim element in similar in scope to a corresponding claim element in Claim 5, and hence is rejected under similar rationale.).  
Maehara in view of Ribeiro and Dusparic are analogous art since both teach the use of reinforcement learning in multi-agent systems.
It would have been obvious to a person having ordinary skill in the art before the effective filing date of the invention to take the multi-agent reinforcement learning of Maehara in view of Ribeiro and expand upon the reinforcement learning by incorporating the distributed W-learning techniques of using W-values and cooperation coefficient values of Dusparic as a way to learn and apply contributions between agents in a multi-agent system. The motivation is taught in Dusparic, as distributed W-learning allows agents the run-time ability to directly assign and communicate certain importance to actions resulting from states, so that certain actions with higher importance should be given higher preference, effectively prioritizing certain actions toward determining an optimal policy for the multi-agent system without extensive pre-configuration of the agents, thus optimizing the performance and efficiency of the multi-agent system ([Dusparic paragraph [0026]: “The main operational advantage of the system of the invention is that it utilizes machine learning to learn appropriate behaviours … It removes the need for extensive preconfiguration, as the agents or nodes can configure themselves based on the observed conditions and learnt behaviours, reducing the configuration, deployment, and operational time and costs. … Using remote learning, each junction can automatically learn dependencies between neighbouring junctions, i.e., the effect of one junction's traffic light settings on another for a particular set of traffic conditions, removing the need for manual analysis, …”]).
Regarding Claim 12, Maehara in view of Ribeiro in further view of Dusparic teaches
The control method according to claim 11, 
wherein the fifth step includes a step in which, when the contribution value is a value indicating that contribution of the reward included in the first entry to the objective variable value is small (Under its broadest reasonable interpretation, this claim limitation in a method claim recites a contingent clause that effectively renders the subsequent claim language to not be when the contribution value is a value indicating that contribution of the reward included in the first entry to the objective variable value is small” is not required to be met, and the claimed invention can be practiced without the condition occurring. See MPEP 2111.04(II). Applicant is advised to amend the claim to positively cite the condition as being fulfilled, since no patentable weight is given for the subsequent claim language following a contingent clause that does not require the condition to be fulfilled for practicing the claimed invention. However, for the purposes of examination, this contingent clause will be treated as if the condition were fulfilled.) (This claim element in similar in scope to a corresponding claim element in Claim 6, and hence is rejected under similar rationale.), 
the processor updates the confidence value included in the first entry to a value indicating that statistical confidence is low (This claim element in similar in scope to a corresponding claim element in Claim 6, and hence is rejected under similar rationale.).   

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to WILLIAM WAI YIN KWAN whose telephone number is 303-297-4332.  The examiner can normally be reached on Monday-Friday 8:00am - 4:30pm PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Li Zhen can be reached on 571-272-3768.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications 





/WILLIAM WAI YIN KWAN/Examiner, Art Unit 2121                                                                                                                                                                                                        



/Li B. Zhen/Supervisory Patent Examiner, Art Unit 2121