DETAILED ACTION
This communication is responsive to application 16/929,975 filed 15 July 2020.
The instant application has a total of 15 claims pending in the application, all of which are ready for examination. Claims 1, 6 and 11 are independent claims.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Acknowledgment is made of applicant’s claim for foreign priority under 35 U.S.C. 119 (a)-(d). Receipt is acknowledged of certified copies of papers required by 37 CFR 1.55.

Information Disclosure Statement
As required by M.P.E.P. 609(c), the applicant’s submissions of the Information Disclosure Statements dated 07/15/2020 and 06/07/2022 are acknowledged by the examiner and the cited references have been considered in the examination of the claims now pending. As required by M.P.E.P. 609 C(2), a copy of the PTOL-1449 initialed and dated by the examiner is attached to the instant office action.

	Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) because the claim limitations use a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitations are identified with respect to claims 1 and 4 which recite a device/apparatus comprising:
statistical analyzer configured to…
action selector configured to…
Because these claim limitations are being interpreted under 35 U.S.C. 112(f), they are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof. 
The corresponding structure is interpreted in light of the specification. The specification does not make it clear that the analyzer and selector are limited to particular embodiments. The elements are illustrated by drawing Fig 2:110/130/140 and described by the specification largely repeating claim language per US PG Pub 20210019644A1:
[0039-40] “The statistical analyzer 110 may statistically analyze…”
[0046-47] “The action selector 130 may select an action…”
 [0068], [0075] “the functions or the processes described in the example embodiments may be implemented by software… apparatuses may be incorporated into a single software product” conveys software-only implementation.
If applicant does not intend to have these limitations interpreted under 35 U.S.C. 112(f), applicant may:  (1) amend the claim limitations to avoid them being interpreted under 35 U.S.C. 112(f) (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitations recite sufficient structure to perform the claimed function so as to avoid them being interpreted under 35 U.S.C. 112(f).

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


Claims 1-5 are rejected under 35 U.S.C. 112(b) as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor regards as the invention.
Claim limitations “statistical analyzer” and “action selector” invokes 35 U.S.C. 112(f). However, the written description fails to disclose the corresponding structure, material, or acts for performing the entire claimed function and to clearly link the structure, material, or acts to the function. The elements of statistical analyzer and action selector are illustrated Fig 2:110/130 and described as per [0039-40], [0046-47]. Neither the specification nor the drawings describe sufficient supporting structure for each function that clearly links the structure, material, or acts in performance of the entire claimed function. Therefore, the claim is indefinite and is rejected under 35 U.S.C. 112(b). Examiner’s interpretation of the claimed functions is software functions executable by a computer. Dependent claims fail to cure the deficiency and inherit the rejection. Accordingly, claims 1-5 are rejected as indefinite under §112(b).
Applicant may:
(a)        Amend the claim so that the claim limitation will no longer be interpreted as a limitation under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph; 
(b)        Amend the written description of the specification such that it expressly recites what structure, material, or acts perform the entire claimed function, without introducing any new matter (35 U.S.C. 132(a)); or 
(c)        Amend the written description of the specification such that it clearly links the structure, material, or acts disclosed therein to the function recited in the claim, without introducing any new matter (35 U.S.C. 132(a)).
If applicant is of the opinion that the written description of the specification already implicitly or inherently discloses the corresponding structure, material, or acts and clearly links them to the function so that one of ordinary skill in the art would recognize what structure, material, or acts perform the claimed function, applicant should clarify the record by either: 
(a)        Amending the written description of the specification such that it expressly recites the corresponding structure, material, or acts for performing the claimed function and clearly links or associates the structure, material, or acts to the claimed function, without introducing any new matter (35 U.S.C. 132(a)); or 
(b)        Stating on the record what the corresponding structure, material, or acts, which are implicitly or inherently set forth in the written description of the specification, perform the claimed function. For more information, see 37 CFR 1.75(d) and MPEP §§ 608.01(o) and 2181.

Claims 1-5 rejected under 35 U.S.C. 112(b) as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor regards as the invention.
Claim 1 recites the limitation "the apparatus comprising" in the preamble which introduces a “An agent device”. There is insufficient antecedent basis for this limitation in the claim. The deficiency is not cured by dependent claims. Accordingly, claims 1-5 are rejected under §112(b).
Claims 11-15 rejected under 35 U.S.C. 112(b) as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor regards as the invention.
Claim 11 recites the limitation "the device comprising" in the preamble which introduces claim “An agent apparatus”. There is insufficient antecedent basis for this limitation in the claim. The deficiency is not cured by dependent claims. Accordingly, claims 11-15 are rejected under §112(b).

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1, 6 and 11 are rejected under 35 U.S.C. 103 as being unpatentable over: 
Camilo Gamboa Higuera et al., US Patent 11,501,167B2 hereinafter CGH, as evidenced by Provisional 62/839,599 (attached PTO-892, supporting sections noted below) in view of 
Czarnecki et al., US PG Pub No 20190354867A1 hereinafter Czarnecki.
With respect to claim 1, CGH teaches: 
	An agent device for performing exclusive reinforcement learning {CGH [Col1 Lines66-67] “system for reinforcement learning” (RL) e.g., RL agent for device [Col6 Line18]. Supporting provisional at [0005], [0031]}, the apparatus comprising: 
a statistical analyzer configured to collect state information of sample states of an environment and performs a statistical analysis on the sample states using the collected state information {CGH discloses [Col8 Lines39-40] “state s is sampled from initial state space p0(s)” again at [Col7 Lines54-56] illustrated Fig 2:204 and Fig 3A “s0~p0(s)… st+1~p0(s)”. Statistical analysis is p probability distribution of the sampled state, ~p0(s). The environment iterates over states for an RL objective Fig 1, [Col8 Lines2-3]. Supporting provisional Figs 1-3A, [0025], [P.A3 Alg.1] [P.A3-A4 Sect.3 and 4.1]}; 
a state value determining processor configured to determine a first state value of a first state among the states in a training phase and a second state value of a second state among the states in an inference phase based on analysis results of the statistical analysis {CGH Fig 3A “s0~p0(s)… st+1~p0(s)” where s0 is first state value and st+1 is second state value, (also being referred as next observed state, s’). Statistical analysis is the probability p distribution (e.g., s0~p0(s)), phases include training and testing where testing is inference, see at Fig 3A, [Col9 Line33]. Further, processor is Fig 4:610, [Col9 Lines54-55]. Supporting provisional at Figs 3A/4, [0025], [P.A3 Alg.1], [P.A3-A4 Sect.3 and 4.1]}; 
However, CGH does not make clear that reinforcement learning is “from different perspectives”.
Czarnecki teaches: 
a reinforcement learning processor configured to include a plurality of reinforcement learning unit which perform reinforcement learning from different perspectives according to the first state value {Czarnecki Fig 1:100 illustrates multi-policy reinforcement learning, the policies are perspectives denoted by symbol pi (π1-k) described e.g. [0094-98] “policies that are inputs to the reinforcement loss function” with Eq. of [0098] including s for states, states are introduced as observations over time steps [0028,0054]. Processors are disclosed per [0103,08] as the skilled artisan will readily appreciate because the training data is stored and sampled from RL replay memory [0092-93]}; and 
an action selector configured to select one of actions determined by the plurality of reinforcement learning unit based on the second state value {Czarnecki [0062-63] “action selection policies πmm that are a combination… of individual action selection policies generated by the candidate networks” again detailed per [0089,95-98] describes action selection policies from candidates such as by weighted sum of policies and is determined as training progresses, hence multiple time steps [0046] or indexed st states/observations being a trajectory [0098]}.
	Czarnecki is directed to reinforcement learning thus being analogous. A person having ordinary skill in the art would have considered it obvious prior to the effective filing date to perform RL over multiple policies for action selection per Czarnecki with the state sampled RL of CGH for the motivation “improved performance on the reinforcement learning” (Czarnecki [0073]).

With respect to claim 6, the rejection of claim 1 is incorporated. The scope differs as a method to perform limitations of device claim 1. CGH recites “method or system” [Col1 Line62] as is understood the teachings disclose techniques regarding reinforcement learning. The remainder of the claim is rejected for the same rationale as claim 1.

With respect to claim 11, the rejection of claim 1 is incorporated. The scope differs as an apparatus to perform limitations of device claim 1. The apparatus of claim 11 comprises processor, memory, and communications interface along with executable program stored in memory. CGH Fig 4 illustrates a computing unit with processor, storage and I/O; CGH further describes hardware and software as any combination thereof [Col10 Lines22-25]. The remainder of the claim is rejected for the same rationale as claim 1.

Claims 2, 7 and 12 are rejected under 35 U.S.C. 103 as being unpatentable over CGH and Czarnecki in view of
Kenton et al., “Generalizing from a few environments in safety-critical reinforcement learning” hereinafter Kenton, arXiv: 1907.01475v1.
With respect to claim 2, the combination of CGH and Czarnecki teaches the device of claim 1. 
However, the combination of CGH and Czarnecki does not disclose “sum of the average and standard deviation”.
Kenton teaches wherein: 
	the analysis results of the statistical analysis includes an average and a standard deviation of the sample states {Kenton [P.6 Last¶] “function U… combines mean, µ, and standard deviation, σ” where mean is average and sampled state data is introduced Fig 3, [P.2-4 Sect.2&3.2]}, and 
the state value determining processor is specifically configured to determine the first state value to 1 when an absolute value of the first state is greater than a sum of the average and the standard deviation and determine the first state value to 0 when the absolute value of the first state is less than or equal to the sum of the average and the standard deviation {Kenton details [P.6 Last¶] “discrimination function U = αµ + βσ” teaches sum (+) of the average/mean and the standard deviation. Further, the discrimination function U is described as a “binary classifier” Figs 6, 11 such that binary is determining value to 0 or 1. Finally, the discrimination function is compared to a confidence threshold which suggests greater than or less than absolute value to the skilled artisan, [P.7 ¶1], [P.4 ¶2], Fig 3. The technique is performed by CPU processor [P.13 Last¶]}.
	Kenton is directed to reinforcement learning thus being analogous. A person having ordinary skill in the art would have considered it obvious prior to the effective filing date to include Kenton’s discrimination function in combination to arrive at the invention as claimed for the motivation of improving safety generalization performance over distributional data and with uncertainty information for an ensemble of agents (Kenton [P.8 Sect.5 ¶1], [P.2 Sect.1 ¶5]).

Claims 7 and 12 are rejected for the same rationale as claim 2.

Claims 3, 8 and 13 are rejected under 35 U.S.C. 103 as being unpatentable over CGH, Czarnecki and Kenton in view of 
Lowe et al., “Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments” hereinafter Lowe, arXiv:1706.02275v3
With respect to claim 3, the combination of CGH, Czarnecki and Kenton teaches the device of claim 2, wherein: 
	the plurality of reinforcement learning unit includes a central perspective reinforcement learning unit and a peripheral perspective reinforcement learning unit {Czarnecki multi-policy illustrated Fig 1:110 such that π-policy/perspective reinforcement learning units may include a plurality such as on-policy and off-policy [0087] where on/off-policy is central/peripheral perspective RL units. Additionally, the use of candidate policies are noted throughout}, and 
However, the combination CGH, Czarnecki and Kenton does not disclose “perform the reinforcement learning by using peripheral perspective reinforcement learning unit when the first state value is 1 and perform the reinforcement learning unit by using the central perspective reinforcement learning unit when the first state value is 0”
Lowe teaches:
the reinforcement learning processor is specifically configured to perform the reinforcement learning by using the peripheral perspective reinforcement learning unit when the first state value is 1 and perform the reinforcement learning by using the central perspective reinforcement learning unit when the first state value is 0 {Lowe Fig 1 illustrates multi-policy framework for reinforcement learning and details policies comprising agent policies π and target policies µ corresponding to central/peripheral perspective RL units, see [P.5 Sect.4]. Policy distribution is evaluated by entropy to approximate policy [P.5 Sect4.2], the analysis uses a Q-learning function for observations which are introduced as over a state distribution [0,1] per [P.3 Sect.3 ¶1]. Lowe further discloses agents taking binary data [P.4 Prop.1] and algorithmic implementation is provided such that target policy is to select action [P.13 Alg.1]}.
	Lowe is directed to reinforcement learning thus being analogous. A person having ordinary skill in the art would have considered it obvious prior to the effective filing date to combine the teachings of Lowe to arrive at the invention as claimed for the motivation of learning policy distribution (Lowe [P.5 Sect4.2]).

Claims 8 and 13 are rejected for the same rationale as claim 3.

Claims 4, 9 and 14 are rejected under 35 U.S.C. 103 as being unpatentable over CGH, Czarnecki and Kenton in view of Lowe.
With respect to claim 4, the combination of CGH and Czarnecki teaches the device of claim 1. 
However, the combination of CGH and Czarnecki does not disclose “sum of the average and standard deviation”.
Kenton teaches wherein: 
	the analysis result of the statistical analysis includes an average and a standard deviation of collected sample states {Kenton [P.6 Last¶] “function U… combines mean, µ, and standard deviation, σ” where mean is average and sampled state data is introduced Fig 3, [P.2-4 Sect.2&3.2]}, 
the state value determining processor is specifically configured to determine the second state value to 1 when an absolute value of the second state is greater than a sum of the average and the standard deviation and determine the second state value to 0 when the absolute value of the second state is less than or equal to the sum of the average and the standard deviation {Kenton [P.6 Last¶] “discrimination function U = αµ + βσ” teaches sum (+) of the average/mean and the standard deviation. Further, the discrimination function U is described as a “binary classifier” Figs 6, 11 such that binary is determining value to 0 or 1. Finally, the discrimination function is compared to a confidence threshold which suggests greater than or less than absolute value to the skilled artisan, [P.7 ¶1], [P.4 ¶2], Fig 3. The technique is performed by CPU processor [P.13 Last¶]. In addition, the state value being second is demonstrated Fig 3 where subscript of state denotes time or step index as reinforcement learning iterates over series data}, The motivation for combination is equally applied as in claim 2.
However, the combination of CGH, Czarnecki and Kenton does not specify that the multipolicy RL is for action selection specifically configured to “select an action determined by the peripheral perspective reinforcement learning unit when the second state value is 1 and select an action determined by the central perspective reinforcement learning unit when the second state value is 0”
Lowe teaches:
the plurality of reinforcement learning unit includes a central perspective reinforcement learning unit and a peripheral perspective reinforcement learning unit {Lowe Fig 1 reinforcement framework is multi-policy/perspective where agent policy π and target policy µ corresponds to the central and peripheral perspective RL units [P.5]}, and 
the action selector is specifically configured to select an action determined by the peripheral perspective reinforcement learning unit when the second state value is 1 and select an action determined by the central perspective reinforcement learning unit when the second state value is 0 {Lowe [P.13 Alg.1] “select action a = µθi(oi) + Nt w.r.t. the current policy and exploration” policy-µ being detailed as target policy approximated per [P.5] where target/agent policies are central/peripheral perspective RL units and the state distribution is [0,1] introduced per [P.3]}. 
	Lowe is directed to reinforcement learning thus being analogous. A person having ordinary skill in the art would have considered it obvious prior to the effective filing date to combine the teachings of Lowe to arrive at the invention as claimed for the motivation of learning policy distribution and where policies are utilized for action selection (Lowe [P.5 Sect4.2], [P.13 Alg.1]).

Claims 9 and 14 are rejected for the same rationale as claim 4.

Claims 5, 10 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over CGH and Czarnecki in view of  
Mavrin et al., “Distributional Reinforcement Learning for Efficient Exploration” hereinafter Mavrin, arXiv: 1905.06125v1.
With respect to claim 5, the combination of CGH and Czarnecki teaches the device of claim 1. 
However, the combination CGH and Czarnecki does not expressly disclose “normal distribution” or that state values are based on “locations” of the states.
Mavrin teaches wherein: 
	the analysis result of the statistical analysis includes an average and a standard deviation of collected sample states, and the state value determining processor is specifically configured to determine the first state value and the second state value based on locations of the first state and the second state in a normal distribution in accordance to the average and the standard deviation {Mavrin [P.3 Sect3.1] “normal distribution N (µk, σk)” details mean/average µ and standard deviation σ, further [P.3 Sect2.3] discloses states being evaluated by distributional reinforcement learning over indexed quantiles of the distribution, sampling is described over locations/positions per [P.7 Sect5.1],. The use of standard deviation is disclosed as an alternative embodiment to variance as used e.g. Eqs. 3-4 [P.4]. Finally, a processor is that which performs the compute and does so with a replay buffer [P.3]}.
	Mavrin is directed to reinforcement learning thus being analogous. A person having ordinary skill in the art would have considered it obvious prior to the effective filing date to specify distribution as normal distribution over mean and variance for states based on locations per Mavrin to arrive at the invention as claimed for the motivation of learning the distribution in a reinforcement learning regime which leads to optimistic exploration in the face of uncertainty (Mavrin [P.3 Sect2.3], [P.1 Sect.1 ¶1]). Result performance is described as state-of-the-art (Mavrin [P.8 Sect.7]).

Claims 10 and 15 are rejected for the same rationale as claim 5.

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure: 
Bellemare et al., “A Distributional Perspective on Reinforcement Learning” arXiv: 1707.068887v1 with over 1,000 citations widely acknowledged as the first work on distributional reinforcement learning, see Fig 1. Relates to WO2018189404A1 and US2021011027A1 or US20200364557A1 similar all DeepMind noted for Applicant
Hernandez Leal et al., US20200143208A1 teaches distributed/async RL, see Fig 1B.
Ding et al., US20200380353A1 see Fig 4 distributional RL replete w/ mean & std.dev.
Patel, Yagna “Optimizing Market Making using Multi-agent Reinforcement Learning” arXiv: 1812.10252v1, teaches z-score [P.4 Eq.5.1].

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to Chase P Hinckley whose telephone number is (571)272-7935. The examiner can normally be reached M-F 9:00 - 5:00.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda M. Huang can be reached on 571-270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/CHASE P. HINCKLEY/Examiner, Art Unit 2124                                                                                                                                                                                                        
/YING YU CHEN/Primary Examiner, Art Unit 2125