DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Drawings
The drawings were received on 04/27/2017.  These drawings are acceptable.


Response to Amendment
Applicant’s response filed 10/28/2021 has been fully consider by the examiner.

In response to applicant’s arguments and amendments, see pgs. 12-13 of response, with respect to claim rejection under 35 U.S.C. §112(a) and §112(b) the applicant’s arguments have fully considered. In consideration of applicant arguments and claim amendments, the rejection made in the previous office action has been maintained. The applicant has not provided clarity regarding the support for the specified process for comparing differences as noted in the amended claim limitations and the newly amended claims provide for the term “predetermined threshold” to function as a numeric value and receiving element. The applicant should provide some indication regarding what support is provided for the including the highlighted claim limitations and consider modifying the claim language to make intended scope of the newly added limitations clear to persons having ordinary skill in the art/  See current action below for the analysis of the amended claim limitations.

In response to applicant’s arguments and amendments, see pgs. 13-22 response, with respect to claim rejection under 35 U.S.C. §103 the applicant’s arguments have been fully considered and the rejection has been maintained.
	Applicant has argued that the prior made of record do not disclose the claim limitations as claimed in independent claims 1, 13, and 17.

	First, applicant argues, in pgs. 15-17: 
	“As an example, the proposed Cal-Tom-Ant combination fails to disclose, teach, or suggest accessing a data set comprising tuples for training a recurrent machine-learning model, each tuple for training the recurrent machine-learning model being associated with a first user and comprising a (1) first action, a (2) first state, a (3) first reward, a (4) second action, and a (5) second state, wherein a first time period is associated with the first state, the first action, and the… and wherein the (5) second Tom merely discloses that a neural network is trained based on experience data, where "each piece of experience data is an experience tuple that includes a respective current observation characterizing a respective current state of the environment, a respective current action performed by the agent in response to the current observation, a respective next state characterizing a respective next state of the environment, and a reward received in response to the agent performing the current action." Applicant respectfully submits that the training data set of Tom is not the same as the training data set of Claim 1. For instance, the training data set or experience data or tuples in Tom for the training does not include a next or subsequent action performed by the agent in response to the next state, much less each tuple for training the recurrent machine-learning model... comprising a (1) first action, a (2) first state, a (3) first reward, a (4) second action, and a (5) second state, wherein...the (4) second action represents the computing system sending or not sending a second notification to the first user in the second time period following the first time period..., as recited by amended independent Claim 1. Ant does not make up for the deficiencies of either Cal or Tom, and the Examiner does not assert otherwise.

In response the examiner notes that the office actions notes that Collison et al. (US Pub. No. 2012/0310961, hereinafter ‘Cal’)  teaches the claimed reinforcement learning technique that use Q-learning to capture state action sequences associated with a reward value using the State-Action-Reward­State-Action for capturing the claimed recurrent machine learning model data over an observation time period, in 0064: In some embodiments, updating the values of the selected state table records in block 308B [i.e. accessing a data set comprising … for training a recurrent machine-learning model, each … for training the recurrent machine-learning model being associated with a first user and comprising a (1) first action, a (2) first state, a (3) first reward, a (4) second action, and a (5) second state] involves using the state table in a model of a finite Markov Decision Process (MDP) and using a reinforcement learning technique  [i.e. accessing a data set comprising … for training a recurrent machine-learning model, each … for training the recurrent machine-learning model being associated with a first user and comprising a (1) first action, a (2) first state, a (3) first reward, a (4) second action, and a (5) second state] to approximate solutions for updating the values of the selected state table records. In some embodiments, a suitable approxi­mation technique for the value function of the model involves temporal difference (TD) reinforcement learning [i.e. a recurrent machine-learning model, each … for training the recurrent machine-learning model being associated with a first user and comprising a (1) first action, a (2) first state, a (3) first reward, a (4) second action, and a (5) second state, wherein a first time period is associated with the first state, the first action, and the first reward, wherein a second time period following the first time period is associated with the second state and the second action] which may be formulated to take advantage of the so-called eligibility trace … In some embodiments, the block 308 process of updating the values of the selected state table records involves application of a rein­forcement learning technique known as Q-Learning [i.e. comprising a (1) first action, a (2) first state, a (3) first reward, a (4) second action, and a (5) second state]. In some embodiments, the block 308 process of updating the values of the selected state table records involves application of a rein­forcement learning process known as State-Action-Reward­State-Action ( or SARSA) [claimed tuples: comprising a (1) first action, a (2) first state, a (3) first reward, a (4) second action, and a (5) second state]…;
The use of the learning model is captured over a time interval for computing the temporal differences and reward values captured with the user log data as depicted in Figs. 4-5, and in Cal 0064: In some embodiments, updating the values of the selected state table records in block 308B involves using the state table in a model of a finite Markov Decision Process (MDP) and using a reinforcement learning technique to approximate solutions for updating the values of the selected state table records. In some embodiments, a suitable approximation technique for the value function of the model involves temporal difference (TD) reinforcement learning [claimed first time period between first and second state action pairs] which may be formulated to take advantage of the so-called eligibility trace λ….. In some embodiments, the block 308 process of updating the values of the selected state table records involves application of a reinforcement learning technique known as Q-Learning. In some embodiments, the block 308 process of updating the values of the selected state table records involves application of a reinforcement learning process known as State-Action-Reward-State-Action (or SARSA) Learning. In other embodiments, a Monte Carlo method may be used in the block 308 process of updating the values of the selected state table records. Cal teaches the learning system that trains a model using reinforcement learning techniques using State, Action, Reward, State, Action, in [0064] that are usually captured as tuples in computing applications, and depicted in Figs. 5A-E for generating notifications to the user using feedback information, in [0076]-[0081]. 
While, Cal does not expressly disclose the data set as a set of tuples.
Schlau et al. (US Pub. No. 2017/0140269, hereinafter ‘Tom’)  does expressly disclose the data set as a set of tuples as recited in claim 1 limitation accessing a data set comprising tuples, (Tom teaches accessing data as observation tuples characterizing a state corresponding to a performed action by an agent under observation and reward in response to a performed action, in [0011] In some implementations, each piece of experience data is an experience tuple [i.e. accessing a data set comprising tuples] that includes a respective current observation characterizing a respective current state of the environment, a respective current action performed by the agent in response to the current observation, a respective next state characterizing a respective next state of the environment, and a reward received in response to the agent performing the current action)

One of ordinary skill in the arts would have been motivated to integrate the disclosed methods by Cal and Tom in order to enable reinforcement learning systems to respond to received observations using neural networks to predict output from a received input (Tom, [0002]-[0005]); doing so will realize the following advantages: “training data from a replay memory [that] can be selected in a way that increases the value of the selected data for training a neural network. This can, in turn, increase the speed of training of neural networks used in selecting actions to be performed by agents and reduce the 
	The combination of Cal and Tom teach the use of the neural network for training using reinforcement learning techniques as claimed.

	Examiner notes: This interpretation of the recurrent model is inline with applicant specification that includes models using state action reward sequences as recurrent model data sets, in applicant’s specification, pg. 9, in 26: The recurrent machine-learning model may be configured to take as inputs a
state and an action associated with a user and a time period, and predict a cumulative reward
over a plurality of time periods (e.g., 5, 7, 10 days). The cumulative reward may be recurrently
defined based on at least (1) a reward R(st, at) associated with the user and the time period and
(2) an application of the recurrent machine-learning model to a subsequent state and a
subsequent action associated with the user and a subsequent time period relative to the time
period (e.g., denoted V(st+1, at+1))… See MPEP 2111 for claim scope requirements used in making the rejection where the examiner has used the interpretation in light of applicant specification of the recurrent machine learning for associating state and subsequent actions over a time period as depicted in Cal Fig. 3A and the cited portions noted above.

Examiner notes that the original disclosure supports the interpretation of a recurrent machine learning model as claimed to be inclusive of reinforcement learning models as taught by the cited references: in Cal 0064; Tom 0059-0065; Ant: Sec. 2 and Zhu:0050-0056. Also, see definition for reinforcement learning (RL), noted in applicant specification as a recurrent model, the term RL is known as person of ordinary skill in the art as defined, in: http://www.scholarpedia.org/article/Reinforcement_learning: Reinforcement learning (RL) is learning by interacting with an environment. An RL agent learns from the consequences of its actions, rather than from being explicitly taught and it selects its actions on basis of its past experiences (exploitation) and also by new choices (exploration), which is essentially trial and error learning. The reinforcement 

    PNG
    media_image1.png
    584
    689
    media_image1.png
    Greyscale


This definition is inline with the claim limitations as disclosed and described by the applicant’s specification. The examiner has followed the guidelines per MPEP 2111. The examiner cautions the applicant that the recitation in specification regarding the use of recurrent neural networks associated with training a recurrent neural network  and the equation noted in  [0024] are not expressly required by the claim limitation and the specification and claim limitations do not appear to have sufficient disclosure to support an interpretation that a recurrent learning model excludes reinforcement learning models and Q-learning as claimed by the applicant’s claim limitations. See MPEP 2111 for guidelines regarding the proper interpretation of claim scope in light of applicant’s disclosure. 
The rejection is maintained.


	“Applicant respectfully submits that there is no disclosure in these portions or any other portions of Cal regarding generating, using the trained machine-learning model, a second reward estimate representing a second measure of the target user engaging in the one or more predetermined activities at the future time period based on at least a second-action representation of not sending the particular notification to the target user in the particular time period, as recited by independent Claim 1. Tom and Ant do not make up for the deficiencies of Cal, either alone or in combination, and the Examiner does not assert otherwise.

Furthermore, the proposed Cal-Tom-Ant combination fails to disclose, teach, or suggest comparing a difference between the first reward estimate of the target user engaging in the one or more predetermined activities upon sending of the particular notification and the second reward estimate of the target user engaging in the one or more predetermined activities without sending of the particular notification to a predetermined threshold and sending the particular notification to the target user based on the difference between the first reward estimate of the target user engaging in the one or more predetermined activities upon sending of the particular notification and the second reward estimate of the target user engaging in the one or more predetermined activities without sending of the particular notification being greater than the predetermined threshold, as independent Claim 1 further recites… [24] “In particular embodiments, the trained V(s,a) model may be used to implement an optimization policy for sending out notifications based on the predicted effectiveness of a notification on an administrator: … [27] “In particular embodiments, once the recurrent machine-learning is trained, it may be used in operation to predict the effectiveness of a notification on a particular target user and use the prediction to decide whether to send that target user a notification. At step 130, the system may determine whether there are potential target users to notify. For example, the system may wish to send notifications to 50 million of 300 million page administrators. At step 140, the system may access data associated with a target user, such as data comprising the target user’s state. At step 150, the system may generate a first reward estimate associated with the target user using the trained recurrent machine-learning model. The first reward estimate output by the model may represent a predicted cumulative reward of sending a notification to the target user. For example, the target user’s state and an action representation of sending a notification (e.g., V(s, send)) to the target user may be used as inputs for the trained recurrent machine-learning model. In particular embodiments, at step 160, the system may generate a second reward estimate associated with the target user using the trained recurrent machine-learning model. The second reward estimate output by the model may represent a predicted cumulative reward of not sending a notification to the target user. For example, the target user’s state and an action representation of not sending a notification (e.g., V(s, not_send)) to the target user may be used as inputs for the trained recurrent machine-learning model. 

		
	In response, examiner notes that the Cal references teaches sending notification as messages including resources and assessments for users and target users to interact with in a particular time interval over a sequence of action state pairs as depicted in Figs. 3-4. The set of messages that can be send to a user are activities for the user to interact with for determining an reward value captured in Figs. 4A and 5A-D and the use of the threshold to filter the set of records to consider for the learning system as the system for filtering records where a message ID is present and not present (i.e. not sent) in the set of filtered records (e.g. Fig 4A {1,4,7} sent to action A4 and {1,7} not sending 4 and interacting with resource 3 and {2,4,7} not sending 1 and interacting with A4 where each state action entry has a reward value is captured within a time period depicted in Fig. 3) and the process for filter the depicted user log data based on a predetermined threshold, in 0073: Once the highest-value filtered (feedback-generat­ing) record is ascertained in block 374, method 376 proceeds to block 376. Block 376 involves procuring all of the block 372 filtered state table records which have values within a threshold range of the block 374 highest value state table record. In the case of example filtered state table 372A, method 376 involves procuring all of the records having val­ues within a threshold range of the value of record 374A. The particular threshold used in block 376 may be a configurable (e.g. user configurable or system configurable) parameter of learning system 100. FIG. 5C shows a set of filtered and thresholded records 376A [including comparing a difference between the first reward estimate of the target user engaging in 
	Per MPEP 2111, claims must be given broadest reasonable interpretation in light of applicant specification, and the notion of requiring a flag binary value as argued by the applicant is not expressly required by the applicant claims. The Cal reference teaches the amended claims as claimed. Thus, the rejection made in the previous office action has been maintained and has been updated to address the newly amended claim limitations.

Claim Rejections - 35 USC § 112-Written Description and New Matter
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person 

Claims 1-2, 5-14, 17-18, and 21-26 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention.

Regarding claim 1, the limitation “ by the computing device, comparing a difference between the first reward estimate of the target user engaging in the one or more predetermined activities upon sending of the particular notification and the second reward estimate of the target user engaging in the one or more predetermined activities without sending of the particular notification to a predetermined threshold; and by the computing device, sending the particular notification to the target user based on the difference between the first reward estimate of the target user engaging in the one or more predetermined activities upon sending of the particular notification and the second reward estimate of the target user engaging in the one or more predetermined activities without sending of the particular notification being greater than the predetermined threshold.” (emphasis added) that contains subject matter which was not described in the applicant’s original disclosure. Specifically, the applicant’s specification, filed 4/27/2017, discloses a machine learning process for making predictions where the use of threshold is disclosed as follows:
Applicant’s original disclosure notes that comparing differences between a first reward based on sending the particular notification and a second for not sending the particular notification within a predetermined threshold based on  not sending the claimed notification being greater than the 
 “…Based on the long-term value ( e.g., predicted reward over a 7 day period) of sending a notification and not sending a notification to each page-admin pair, notification may be selectively sent to those page-admin pairs that maximizes this difference (the difference represents an improvement or effectiveness of sending notifications). In particular embodiments, a notification may be sent to a particular administrator when the score difference is larger than a predetermined threshold, for example. After this, that pair may be removed from the candidate set, and the process may repeat N times until the target number of notifications has been sent (e.g., 50 million) or until the difference falls below some threshold… In particular embodiments, the system may determine whether to send a notification to the target user based on a difference between the first reward estimate and the second reward estimate, which may represent a predicted measure of the effectiveness of the notification on the target user. The difference may be compared to a predetermined threshold to determine whether sending notification is justified. At step 180, the system may send a notification to the target user if the threshold criteria are met. If the threshold criteria are not met, the system may choose to not send any notification to the target user.…”, in para.[24]- [28] of pages 8-10. 

The applicant’s original specification/disclosure fails to disclose the amended claim limitations as claimed. The newly amend claims are directed to a specific process of steps where the reward is estimated for a notification that is not sent based on a criteria being met.  The specification appears to only cover a process  that depends on when the notification is not sent and a threshold not being meet, it is not clear how the newly recited order operations is supported/inferred by the original specification. The application does not disclose the claimed order of operations for processes the notifications not sent based on a meeting a threshold criterion when comparing differences between sending a particular notification to a target user engaging with one or more activities and satisfying the particular notification being greater than the predetermined threshold as claimed by the amended limitations. In addition, the applicant’s filed remarks fail to disclose where support of the claim amendments are supported by the original disclosure. Therefore, the amended claim limitation is considered new matter. 
	
Regarding claims 13 and 17, the claims are similar to claim 1 and are rejected under the same rationale.


Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-2, 5-14, 17-18, and 21-26 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
	Regarding claim 1, the claim recites the limitation “ by the computing device, comparing a difference between the first reward estimate of the target user engaging in the one or more predetermined activities upon sending of the particular notification and the second reward estimate of the target user engaging in the one or more predetermined activities without sending of the particular notification to a predetermined threshold; and by the computing device, sending the particular notification to the target user based on the difference between the first reward estimate of the target user engaging in the one or more predetermined activities upon sending of the particular notification and the second reward estimate of the target user engaging in the one or more predetermined activities without sending of the particular notification being greater than the predetermined threshold.” that render the claims indefinite because the claim is unclear. Specifically, the claim discloses that the sending the notification to a predetermined threshold that is than used to make a comparison. The 
Regarding claims 13 and 17, the claims are similar to claim 1 and are rejected under the same rationale.
	Regarding claims 2, 5-14, 17-18, and 21-26, that depend on claims 1, 13 and 17 the dependent claims are rejected as the fail to resolve the deficiencies noted in the independent claims above.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1-2, 5-14, 17-18, and 21-26 are rejected under 35 U.S.C. 103 as being unpatentable over Collison et al. (US Pub. No. 2012/0310961, hereinafter ‘Cal’) in view of Schlau et al. (US Pub. No. 2017/0140269, hereinafter ‘Tom’) in further view of Schwartz (NPL: A reinforcement learning method for maximizing undiscounted rewards, hereinafter ‘Ant’) in further view of Zhu (US Pub. No. 2018/0165745).

Regarding independent claim 1 limitations, Cal teaches a method comprising:
by a computing system, accessing a data set comprising … for training a recurrent machine-learning model, each … for training the recurrent machine-learning model being associated with a first user and comprising a (1) first action, a (2) first state, a (3) first reward, a (4) second action, and a (5) second state, wherein a first time period is associated with the first state, the first action, and the first reward, wherein a second time period following the first time period is associated with the second state and the second action, (using reinforcement learning machine learning techniques using SARSA as the claimed tuples, in 0064: In some embodiments, updating the values of the selected state table records in block 308B [i.e. accessing a data set comprising … for training a recurrent machine-learning model, each … for training the recurrent machine-learning model being associated with a first user and comprising a (1) first action, a (2) first state, a (3) first reward, a (4) second action, and a (5) second state] involves using the state table in a model of a finite Markov Decision Process (MDP) and using a reinforcement learning technique  [i.e. accessing a data set comprising … for training a recurrent machine-learning model, each … for training the recurrent machine-learning model being associated with a first user and comprising a (1) first action, a (2) first state, a (3) first reward, a (4) second action, and a (5) second state] to approximate solutions for updating the values of the selected state table records. In some embodiments, a suitable approxi­mation technique for the value function of the model involves temporal difference (TD) reinforcement learning [i.e. a recurrent machine-learning model, each … for training the recurrent machine-learning model being associated with a first user and comprising a (1) first action, a (2) first state, a (3) first reward, a (4) second action, and a (5) second state, wherein a first time period is associated with the first state, the first action, and the first reward, wherein a second time period following the first time period is associated with the second state and the second action] which may be formulated to take advantage of the so-called eligibility trace … In some embodiments, the block 308 process of updating the values of the selected state table records involves application of a rein­forcement learning technique known as Q-Learning [i.e. comprising a (1) first action, a (2) first state, a (3) first reward, a (4) second action, and a (5) second state]. In some embodiments, the block 308 process of updating the values of the selected state table records involves application of a rein­forcement learning process known as State-Action-Reward­State-Action ( or SARSA) [claimed tuples: comprising a (1) first action, a (2) first state, a (3) first reward, a (4) second action, and a (5) second state]…;Cal teaches the model table data as the assessed  dataset comprising tuples using machine learning techniques and reinforcement learning techniques by the computing system depicted in Fig. 1,
[AltContent: textbox ([img-media_image2.png])]






 in [0064]: In some embodiments, updating the values of the selected state table records in block 308B involves using the state table in a model of a finite Markov Decision Process (MDP) and using a reinforcement learning technique to approximate solutions for updating the values of the selected state table records [for training the recurrent machine-learning model being associated with a first user and comprising a (1) first action, a (2) first state, a (3) first reward, a second action, and a (2) second state, wherein a first time period is associated with the first state, the first action, and the first reward, wherein a second time period following the first time period is associated with the second state and the second actions] In some embodiments, a suitable approxi­mation technique for the value function of the model involves temporal difference (TD) [wherein a first and second time period associated with the respective first and second states used to compute the temporal difference as depicted in Fig. 3A] reinforcement learning which may be formulated to take advantage of the so-called eligibility trace…; captured as a tuple table log elements of observation data records using a state table Q-learning technique, in [0064]…reinforcement learning technique known as Q-Leaning. In some embodiments, the block 308 process of updating the values of the selected state table records involves application of a rein­forcement learning process known as State-Action-Reward­State-Action ( or SARSA) Learning [each tuple being associated with a first user]…; wherein a first and second time period associated with the respective first and second states used to compute the temporal difference as depicted in Fig. 3A: 
[AltContent: textbox ([img-media_image3.png])]



where a first and second time period associated with the respective first and second states captured by the time stamp field (start & end) as the recited first and second time period as depicted in Fig. 3A, in [0047]: Action interface 120 may track the actions of users 142 in relation to the resources in repositories 150 using an action log which may be stored in action database 124… In some embodiments, time stamp field(s) could comprise a single time stamp field indicating that the user accessed the information resource at a particular time or for a particular duration…)
wherein the (1) first action represents the computing system sending or not sending a first notification to the first user in the first time period, (Cal teaches the action representing a user notification was sent as a captured user interaction actions with the online resources associated with an assessment or user actions interacting with a generated recommendation as the first action that represents whether a first notification was sent to the first user in the first time period as depicted in Figs. 3A and 5B, in [0028] The set of accessible information resources may be referred as a state space and information about current posi­tion of a user in the state-space (e.g. a history of the informa­tion resources with which the user has interacted) may be referred to as the user's state. To move from one state to another within the state-space, a user interacts with an infor­mation resource. Such interaction of the user with an infor­mation resource may be referred to as an action… In particular, embodiments, reinforcement-learning techniques may use feedback to ascribe, or otherwise determine, one or more values for state­action pairs. State-action pairs together with their corre­sponding values may be maintained in a state table. Such a state table may be used as a basis for providing information about recommended actions to a variety of users. Information provided (possibly including recommended actions) may be personalized for individual users and/or groups of users [i.e. wherein the (1) first action represents the computing system sending or not sending a first notification to the first user in the first time period; and wherein the first action represents the computing system sending or not sending a first notification to the first user in the first time period as the personalized recommended action captured in the user table]…. [0029] … In the illustrated embodiment, information resource repositories include the internet 150A, one or more general purpose information resource databases 150B and information resources which may be accessed from a learning management system 150C…; where the information in the repository on interactive user resources may be pre-organized, that is displayed as sent notifications to the user as instructions, questions, directives, interactive resources in [0031]: Information resource repositories 150 may hold a wide variety of information resources having a corresponding wide variety of forms. By way of non-limiting example, infor­mation resources can comprise textual resources, audio resources, image-based resources, video resources, interac­tive resources, questions, assessments [wherein the first action represents the computing system sending or not sending a first notification to the first user in the first time period,], executable applica­tions, instructions or directives on how to access and/or use other resources, discussion posts or forums, instructor notes, hints, blogs, any combinations or sub-combinations of these types of resources and/or the like. In general, learning system 100 can accommodate any form of informational resource. In some types of repositories 150 (such as database 150B or learning management system 150C), information resources may be pre-organized or otherwise mapped or classified in some manner within the repository prior to being made acces­sible to learning system 100 [pre-organized, that is displayed as sent notifications to the user as instructions, questions, directives, interactive resources]…; including assessment and initial recommendations associated the first  user’s monitored first action as depicted in Fig. 5B:
[AltContent: textbox ([img-media_image4.png])]








;Cal teaches the fist state represents the movement in the state space associated with the user associated first action as depicted in Fig. 3A, in [0028]: The set of accessible information resources may be referred as a state space and information about current posi­tion of a user in the state-space (e.g. a history of the informa­tion resources with which the user has interacted) may be referred to as the user's state [the user state that represents a characteristic associated with the first user as indicated by the user ID (including a first and second action data captured in the user’s action log) ]. To move from one state to another within the state-space, a user interacts with an infor­mation resource [wherein user log table captures the state representing the move characteristics of the user that includes the activity sequences in the first state that represents characteristics associated with the first user in the first time period, as the characterized sequence user activities in associated with the state log of the first user, i.e. wherein the (1) first action represents the computing system sending or not sending a first notification to the first user in the first time period]. Such interaction of the user with an infor­mation resource may be referred to as an action… In particular, embodiments, reinforcement-learning techniques may use feedback to ascribe, or otherwise determine, one or more values for state­action pairs. State-action pairs together with their corre­sponding values may be maintained in a state table. Such a state table may be used as a basis for providing information about recommended actions to a variety of users. Information provided (possibly including recommended actions) may be personalized for individual users and/or groups of users.; Examiner notes that the table documents the actions of sending and not sending a notification based on noted resource or assessment notification ID for an action state as shown in exemplar table entry depicted in Fig. 4A)
wherein the (2) first state represents characteristics associated with the first user in the first time period, wherein the (3) first reward represents a measure of activity performed by the first user in the first time period, (Cal teaches the reinforcement learning having the first reward  as that represents a measure of activity preformed by the first user in the claimed  first time period in selected in the user state table records as depicted in Fig. 3, in [0064] In some embodiments, updating the values of the selected state table records in block 308B involves using the state table in a model of a finite Markov Decision Process (MDP) and using a reinforcement learning technique to approximate solutions for updating the values of the selected state table records [capturing measure of activity performed by the first user associated the data record ad depicted in Fig. 3A; i.e. wherein the (2) first state represents characteristics associated with the first user in the first time period, wherein the (3) first reward represents a measure of activity performed by the first user in the first time period] In some embodiments, a suitable approxi­mation technique for the value function of the model involves temporal difference (TD) [activity in the first time period as the temporal differences in the user activity time stamp in the user action log table as depicted in Fig. 3A; i.e. wherein the (2) first state represents characteristics associated with the first user in the first time period, wherein the (3) first reward represents a measure of activity performed by the first user in the first time period]…the block 308 process of updating the values of the selected state table records involves application of a reinforcement learning technique known as Q-Leaning [wherein the (2) first state represents characteristics associated with the first user in the first time period, wherein the (3) first reward represents a measure of activity performed by the first user in the first time period]. In some embodiments, the block 308 process of updating the values of the selected state table records involves application of a rein­forcement learning process known as State-Action-Reward­State-Action ( or SARSA) Learning [each tuple being associated with a first user and comprising a first action, a first state, a first reward that represents the ]; where the rewards represents the measure of activity as the values associated with state-action pairs of the first user as disclosed in in [0027] …Feedback may comprise one or more feedback metrics which may be related to a user's interaction with the information resources. In particular embodiments, feedback comprises an assessment that comprises a feed­back metric (or metrics). In particular, embodiments, rein-forcement-learning techniques use these feedback metrics to ascribe, or otherwise determine, one or more values for an action or a series of actions taken by a user in connection with the information resources. Such values may be used as esti­mates of the value of the same action or series of actions for other users… [0057] As discussed above, in the illustrated embodiment, state table 275 includes at least one value field 281 which may represent the value that system 100 associates with perform­ing a corresponding action, given a corresponding state. For example, the third record 285 of state table 275 has a value field 281 which indicates if a user had interacted with resource items with resource IDs={ 1, 4, 7} ( corresponding to the state field 277 of record 285), the next action of interacting with assessment A4 ( corresponding to the action field 279 of record 285) has a value of0.63. Value field 281 may comprise a numerical metric, such that value fields 281 of particular state table records may be easily compared to one another. In the case of the FIG. 4A example, given a state 277 corre­sponding to a user having interacted with resource IDs={ 1, 7}, system 100 considers there to be relatively more value in the next action being interacting with resource ID=ll (value=0.99) than interacting with resource ID=3 (value=0. 72).)
wherein the (4) second action represents the computing system sending or not sending a second notification to the first user in the second time period following the first time period, (Cal teaches the second action representing for a second state-action pair in the user log that represents sending a second notification us sent as part of the activity log associated with a resource recommendation, personalized prompts, or assessments captured in the user activity log depicted in Fig. 3A, in [0028]: The set of accessible information resources may be referred as a state space and information about current posi­tion of a user in the state-space (e.g. a history of the informa­tion resources with which the user has interacted) may be referred to as the user's state [the user state that represents a characteristic associated with the first user as indicated by the user ID (including a first and second action data capturing the users sequential activity as stat-action pairs in a respective time period captured by the time stamp data in the user’s action log), i.e. wherein the (4) second action represents the computing system sending or not sending a second notification to the first user in the second time period following the first time period,]. To move from one state to another within the state-space, a user interacts with an infor­mation resource. Such interaction of the user with an infor­mation resource may be referred to as an action… In particular, embodiments, reinforcement learning techniques may use feedback to ascribe, or otherwise determine, one or more values for state­action pairs. State-action pairs together with their corre­sponding values may be maintained in a state table. Such a state table may be used as a basis for providing information about recommended actions to a variety of users. Information provided (possibly including recommended actions) may be personalized for individual users and/or groups of users [the user state that represents a characteristic associated with the first user as indicated by the user ID (including a first and second action data capturing the users sequential activity as stat-action pairs in a respective time period captured by the time stamp data in the user’s action log)]…” capturing the plurality of user activities including a first and second actions associated with a notification sent to the user as disclosed in [0057] As discussed above, in the illustrated embodiment, state table 275 includes at least one value field 281 which may represent the value that system 100 associates with perform­ing a corresponding action, given a corresponding state. For example, the third record 285 of state table 275 has a value field 281 which indicates if a user had interacted with resource items with resource IDs={ 1, 4, 7} ( corresponding to the state field 277 of record 285), the next action of interacting with assessment A4 ( corresponding to the action field 279 of record 285) has a value of0.63 [capturing the plurality of user activities including a first and second actions associated with a notification sent to the user as part of the respective sequence of action log entries]. Value field 281 may comprise a numerical metric, such that value fields 281 of particular state table records may be easily compared to one another. In the case of the FIG. 4A example, given a state 277 corre­sponding to a user having interacted with resource IDs={ 1, 7}, system 100 considers there to be relatively more value in the next action being interacting with resource ID=ll (value=0.99) than interacting with resource ID=3 (value=0. 72). Examiner notes that the table documents the actions of sending and not sending a notification based on noted resource or assessment notification ID for an action state as shown in exemplar table entry depicted in Fig. 4A [i.e. Examiner notes that the table documents the actions of sending and not sending a notification based on noted resource or assessment notification ID for an action state as shown in exemplar table entry depicted in Fig. 4A including i.e. wherein the (4) second action represents the computing system sending or not sending a second notification to the first user in the second time period following the first time period,])
and wherein the (5) second state represents characteristics associated with the first user in the second time period following the first time period; (Cal teaches the second state represents the movement in the state space associated with the user associated second action log entry captured in the user action log depicted in Fig. 3A, in [0028]: The set of accessible information resources may be referred as a state space and information about current posi­tion of a user in the state-space (e.g. a history of the informa­tion resources with which the user has interacted) may be referred to as the user's state [the user state that represents a characteristic associated with the first user as indicated by the user ID (including a first and second action data captured in the user’s action log);i.e. and wherein the (5) second state represents characteristics associated with the first user in the second time period following the first time period; ]. To move from one state to another within the state-space, a user interacts with an infor­mation resource. Such interaction of the user with an infor­mation resource may be referred to as an action… In particular, embodiments, reinforcement learning techniques may use feedback to ascribe, or otherwise determine, one or more values for state­action pairs. State-action pairs together with their corre­sponding values may be maintained in a state table [i.e. and wherein the (5) second state represents characteristics associated with the first user in the second time period following the first time period]. Such a state table may be used as a basis for providing information about recommended actions to a variety of users. Information provided (possibly including recommended actions) may be personalized for individual users and/or groups of users.; Examiner notes that claimed time periods is captured as part of the reinforcement learning techniques, in [0064])
by the computing system, training  the recurrent machine-learning model using the data set, wherein the recurrent machine-learning model is configured to take as inputs a state and an action and predict a cumulative reward over a plurality of time periods, wherein the state and the action are associated with a user and a time period, (Cal teaches the training  a model to configured to take inputs as depicted in figs. 2 & 3 using the learning management system depicted in Fig. 1, in [0064] In some embodiments, updating the values of the selected state table records in block 308B involves using the state table in a model of a finite Markov Decision Process (MDP) and using a reinforcement learning technique to approximate solutions for updating the values of the selected state table records [computing system training a recurrent machine learning model corresponding to the state table model of a finite Markov Decision Process using reinforcement learning techniques]. In some embodiments, a suitable approxi­mation technique for the value function of the model involves temporal difference (TD) reinforcement learning [i.e. by the computing system, training  the recurrent machine-learning model using the data set, wherein the recurrent machine-learning model is configured to take as inputs a state and an action and predict a cumulative reward over a plurality of time periods, wherein the state and the action are associated with a user and a time period] which may be formulated to take advantage of the so-called eligibility trace A. The eligibility trace A may be a number between [O, 1] which may be used to weight the relevance of past steps (e.g. past states) to a current outcome (e.g. feedback), [the cumulative reward values in depicted in Fig. 4A as s current outcome]. Where A=O, only the most recent state-action pair in the user action log would be updated based on a combination of its existing value and the new value determined by the feedback metric of the block 306 feedback-generating action [the predicted cumulative reward values in depicted in Fig. 4A as updated outcome from the trained prediction model over the plurality of time periods associated with the existing values]. Where A=l, all of the preceding state-action pairs in the user action log would be updated based on corresponding combinations of their exist­ing values [wherein the recurrent machine learning model takes inputs as state and action and the reward  in the user activity data table in Fig. 3A to predict the cumulative reward values in depicted in Fig. 4A as s current outcome] and the new value determined by the feedback metric of the block 306 feedback-generating action…; where the rewards are captured as the numerical values associated with the sequence of user actions as cumulative rewards as disclosed, in [0057] As discussed above, in the illustrated embodiment, state table 275 includes at least one value field 281 which may represent the value that system 100 associates with perform­ing a corresponding action, given a corresponding state [i.e. by the computing system, training  the recurrent machine-learning model using the data set, wherein the recurrent machine-learning model is configured to take as inputs a state and an action and predict a cumulative reward over a plurality of time periods, wherein the state and the action are associated with a user and a time period]. For example, the third record 285 of state table 275 has a value field 281 which indicates if a user had interacted with resource items with resource IDs={ 1, 4, 7} ( corresponding to the state field 277 of record 285), the next action of interacting with assessment A4 ( corresponding to the action field 279 of record 285) has a value of0.63. Value field 281 may comprise a numerical metric, such that value fields 281 of particular state table records may be easily compared to one another. In the case of the FIG. 4A example, given a state 277 corre­sponding to a user having interacted with resource IDs={ 1, 7}, system 100 considers there to be relatively more value in the next action being interacting with resource ID=ll (value=0.99) than interacting with resource ID=3 (value=0. 72) [capturing the cumulative rewards associated with the sequence of actions].)
wherein the cumulative reward is recurrently defined based on at least a reward associated with the user and the time period and an application of the recurrent machine-learning model to a subsequent state and a subsequent action associated with the user and a subsequent time period relative to the time period, (Cal teaches the cumulative reward as the updated feedback metric that is defined based on the reward associated with the user and time period as the sate and action value function for estimating state action values with, as recited in [0056]-[00057]: State field 277 represents a state of a user prior to the action 279 of the current record. State field 277 may comprise references to the resource IDs of particular resource items with which users may interact… Action field 279 represents a next action. As with the action field of user log 250 (FIG. 3A), action field 279 of state table 275 includes possible actions which correspond to the infor­mation resource types ( e.g. general information resources, assessments and questions) being used by resource interface 112 of learning system 100 together with a resource ID ref­erence… As discussed above, in the illustrated embodiment, state table 275 includes at least one value field 281 which may represent the value that system 100 associates with perform­ing a corresponding action, given a corresponding state [the cumulative reward is recurrently defined based on at least a reward associated with the user]…, used to update predicted feedback outcome metric values in by the learning model as recited in [0064] In some embodiments, updating the values of the selected state table records in block 308B involves using the state table in a model of a finite Markov Decision Process (MDP) and using a reinforcement learning technique [the cumulative reward is defined based on at least an application of the recurrent machine-learning model to a subsequent state and a subsequent action associated with the user] to approximate solutions for updating the values of the selected state table records. In some embodiments, a suitable approxi­mation technique for the value function of the model involves temporal difference (TD) reinforcement learning [wherein the time period is captured as the temporal difference associated with the a reward associated with the user and  an application of the recurrent machine-learning model to a subsequent state and a subsequent action associated with the user and a subsequent time period relative to the time period using the reinforcement learning technique; i.e. wherein the cumulative reward is recurrently defined based on at least a reward associated with the user and the time period and an application of the recurrent machine-learning model to a subsequent state and a subsequent action associated with the user and a subsequent time period relative to the time period] which may be formulated to take advantage of the so-called eligibility trace A… Where A=l, all of the preceding state-action pairs in the user action log would be updated based on corresponding combinations of their existing values and the new value determined by the feedback metric of the block 306 feedback-generating action. In some embodiments, the block 308 process of updating the values of the selected state table records involves application of a reinforcement learning technique known as Q-Leaning [i.e. wherein the cumulative reward is recurrently defined based on at least a reward associated with the user and the time period and an application of the recurrent machine-learning model to a subsequent state and a subsequent action associated with the user and a subsequent time period relative to the time period]. In some embodiments, the block 308 process of updating the values of the selected state table records involves application of a rein­forcement learning process known as State-Action-Reward­State-Action ( or SARSA) Learning [i.e. wherein the cumulative reward is recurrently defined based on at least a reward associated with the user and the time period and an application of the recurrent machine-learning model to a subsequent state and a subsequent action associated with the user and a subsequent time period relative to the time period and a subsequent action associated with the user and a subsequent time period relative to the time period using the application of the reinforcement learning process known as State-Action-Reward-State-Action]. In other embodiments, a Monte Carlo method may be used in the block 308 process of updating the values of the selected state table records.)
by the computing system, determining a particular notification for prompting a recipient of the particular notification to engage in one or more predetermined activities, (claimed particular notification as resources to engage in claimed predetermined activities as learning actives, in 0027: Aspects of the invention provide methods and systems for providing information based on feedback. Feedback may be incorporated into the information provided using reinforcement learning. Information provided by the methods and systems of particular embodiments can comprise information about feedback-driven recommendations for actions in connection with accessible information resources [claimed particular notification for prompting a recipient of the particular notification to engage in one or more predetermined activities]…; information resources as content/messages to the user associated with predetermined activities via a resource repository, in 0029-0033: In the FIG. 1 embodiment, system 100 comprises a learning system and the feedback-driven information retrieved by learning system 100 comprises feedback-driven recommen­dations for user actions in relation to information resources. Such information resources may comprise educational infor­mation or content…. For example, an information repository 150 could be a topical repository 150 on the topic of astronomy, in which case it may accept contribution of information resources from a number of astronomy experts. Information resource repositories 150 described herein are merely representative examples of suitable types of informa­tion repositories 150 and, unless specifically claimed, are not meant to be limiting... Information resource repositories 150 may hold a wide variety of information resources having a corresponding wide variety of forms. By way of non-limiting example, infor­mation resources can comprise textual resources, audio resources, image-based resources, video resources, interac­tive resources, questions, assessments, executable applica­tions, instructions or directives on how to access and/or use other resources, discussion posts or forums, instructor notes, hints, biogs, any combinations or sub-combinations of these types of resources and/or the like…Resource interface 112 may pull information resources from repositories 150 and/or repositories 150 may push informa­tion resources to resource interface 112.)
by the computing system, generating, using the trained recurrent machine-learning model, a first reward estimate representing a first measure of a target user engaging in the one or more predetermined activities at a future time period occurring after a particular time period based on action representation of sending the particular notification to the target user in the particular time period; Active 58589431.1ATTORNEY DOCKETPATENT APPLICATION(claimed trained model as the model of the user current state space using claimed recurrent machine learning model as the recurrent learning techniques using feedback, in 0028: The set of accessible information resources may be referred as a state space and information about current posi­tion of a user in the state-space (e.g. a history of the informa­tion resources with which the user has interacted) may be referred to as the user's state. To move from one state to another within the state-space, a user interacts with an infor­mation resource. Such interaction of the user with an infor­mation resource may be referred to as an action. A current state of a user coupled with an action which will transition the user to a new state may be referred to as a state-action pair. The interaction of states and actions and how an action taken by a user transitions the user from one state to another state may be referred to as a model. In particular embodiments, reinforcement learning techniques may use feedback to ascribe, or otherwise determine, one or more values for state­action pairs….; where the current state space is used to generate subsequent action-state pairs at claimed future time from the current time associated with a captured action-state pair, as the claimed particular period of time based on the represented action depicted in Fig. 5D for a target user as the personalized user in the modeled group of user, in 0028: … In particular embodiments, reinforcement learning techniques may use feedback to ascribe, or otherwise determine, one or more values for state­action pairs. State-action pairs together with their corre­sponding values may be maintained in a state table. Such a state table may be used as a basis for providing information about recommended actions to a variety of users. Information provided (possibly including recommended actions) may be personalized for individual users and/or groups of users.; and as depicted in Fig. 5D and captured in the state action table for the target user as depicted in Fig. 3A

    PNG
    media_image5.png
    485
    405
    media_image5.png
    Greyscale



    PNG
    media_image6.png
    385
    715
    media_image6.png
    Greyscale

In 109: … One technique for providing personalized action recommendations involves the use of the user's current state [claimed particular time period based on action representation of sending the particular notification to the target user in the particular time period], which is reflective of the history of actions of that user in relation to accessible information resources. By way of non-limiting example, the user's current state may be used by learning system 100 in some embodiments to personalize the recommendation blocks (356 and 358) of method 350 (FIG. SB) and more particularly in connection with the illustrated embodiments of recommendation procedures 400 (FIG. 5D), 450 (FIG. SE) and 500 (FIG. SF). Each of these exemplary embodiments of methods for recommending actions may personalize recommended actions by taking into account the user's current state (or action history) when making recom­mendations.; and where updating the stat-action values involves estimating a first reward in the SARSA reinforcement learning technique, in 0064: … In some embodiments, the block 308 process of updating the values of the selected state table records involves application of a rein­forcement learning technique known as Q-Leaming. In some embodiments, the block 308 process of updating the values of the selected state table records involves application of a rein­forcement learning process known as State-Action-Reward­State-Action ( or SARSA) Learning…) 079894.473415/499,835 4 of 16
by the computing system, generating, using the trained machine-learning model, a second reward estimate representing a second measure of the target user engaging in the one or more predetermined activities at the future time period based on at least a second-action representation of not sending the particular notification to the target user in the particular time period; (claimed second reward estimate as the action value associated with a set of filtered actions for the user to engage with that does not include the claimed particular notification in a second-action that part of the user’s activity path for performing the state-action update associated a second-action representation that takes the user another activity path captured by the filtered list of values if for the set of claimed second reward estimate of the claimed second-action in the list where the score dose not including sending the claimed notification sent to the target user, as depicted in Fig. 5E and Fig. 5F, in 0083: Method 450 then proceeds to block 458 which involves an inquiry into whether any of the block 454 set of path records have corresponding actions that are in the block 354 target state (see FIG. 5B). As discussed above, the block 354 target state comprises one or more action entries. If any of these action entries of the block 354 target state correspond to the action field of the block 454 set of path records, then the block 458 inquiry is positive. Otherwise, the block 458 inquiry is negative… If the block 458 inquiry is negative (i.e. there are no block 454 path records having action entries among the actions of the block 354 target state) [a second-action representation of not sending the particular notification to the target user in the particular time period], then method 450 pro­ceeds to block 460. Block 460 involves setting aside the block 454 set of path records and generating a weighted average [claimed second reward estimate representing a second reward estimate representing a second measure of the target user engaging in the one or more predetermined activities at the future time period based on at least a second-action representation of not sending the particular notification to the target user in the particular time period] of the values for the state table records (within the block 353 subset) having an action in the block 354 target state… Once the weighted averages are calculated in block 460, method 450 proceeds to block 462 which involves selecting the action corresponding to the highest block 460 weighted average  [claimed second measure of the target user engaging in the one or more predetermined activities at the future time period based on at least a second-action representation of not sending the particular notification to the target user in the particular time period] to be the next recommended action [i.e. based on at least a second-action representation of not sending the particular notification to the target user in the particular time period]. In the case of the illustrative example set out above, block 462 would involve selecting action=7 (i.e. interact with resource ID=7) to be the next recommended action, since the weighted average for action=7 is greater than the weighted average for action=1...)
by the computing device, comparing a difference between the first reward estimate of the target user engaging in the one or more predetermined activities upon sending of the particular notification and the second reward estimate of the target user engaging in the one or more predetermined activities without sending of the particular notification to a predetermined threshold; (Cal teaches engaging the target user that is associated with the selected predetermined actives upon sending and not sending a particular notification as noted by the resource state id in the sequence of state and action pairs captured with their respective reward values at the respective observation sequence time interval as depicted in Figs. 4A and 5A-D and the use of the threshold in 0073: Once the highest-value filtered (feedback-generat­ing) record is ascertained in block 374, method 376 proceeds to block 376. Block 376 involves procuring all of the block 372 filtered state table records which have values within a threshold range of the block 374 highest value state table record. In the case of example filtered state table 372A, method 376 involves procuring all of the records having val­ues within a threshold range of the value of record 374A. The particular threshold used in block 376 may be a configurable (e.g. user configurable or system configurable) parameter of learning system 100. FIG. SC shows a set of filtered and thresholded records 376A [including comparing a difference between the first reward estimate of the target user engaging in the one or more predetermined activities upon sending of the particular notification and the second reward estimate of the target user engaging in the one or more predetermined activities without sending of the particular notification to a predetermined threshold] corresponding to a value threshold of0.40 from the highest-value record 37 4A . In this exemplary case, the value of highest-value record is 0.99 and the thresh­old is 0.40, so only records having values greater than 0.99- 0.40=0.59 or greater are admitted into the set of filtered and thresholded records 376A [i.e. by the computing device, comparing a difference between the first reward estimate of the target user engaging in the one or more predetermined activities upon sending of the particular notification and the second reward estimate of the target user engaging in the one or more predetermined activities without sending of the particular notification to a predetermined threshold;  ].)
and by the computing device, sending the particular notification to the target user based on the difference between the first reward estimate of the target user engaging in the one or more predetermined activities upon sending of the particular notification and the second reward estimate of the target user engaging in the one or more predetermined activities without sending of the particular notification  being greater than the predetermined threshold. (Cal teaches engaging the target user that is associated with the selected predetermined actives upon sending and not sending a particular notification as noted by the resource state id in the sequence of state and action pairs captured with their respective reward values at the respective observation sequence time interval as depicted in Figs. 4A and 5A-D and the use of the threshold in 0073: Once the highest-value filtered (feedback-generat­ing) record is ascertained in block 374, method 376 proceeds to block 376. Block 376 involves procuring all of the block 372 filtered state table records which have values within a threshold range of the block 374 highest value state table record. In the case of example filtered state table 372A, method 376 involves procuring all of the records having val­ues within a threshold range of the value of record 374A. The particular threshold used in block 376 may be a configurable (e.g. user configurable or system configurable) parameter of learning system 100. FIG. SC shows a set of filtered and thresholded records 376A [including i.e. and by the computing device, sending the particular notification to the target user based on the difference between the first reward estimate of the target user engaging in the one or more predetermined activities upon sending of the particular notification and the second reward estimate of the target user engaging in the one or more predetermined activities without sending of the particular notification ...] corresponding to a value threshold of0.40 from the highest-value record 37 4A. In this exemplary case, the value of highest-value record is 0.99 and the thresh­old is 0.40, so only records having values greater than 0.99- 0.40=0.59 or greater are admitted into the set of filtered and thresholded records 376A [i.e. and by the computing device, sending the particular notification to the target user based on the difference between the first reward estimate of the target user engaging in the one or more predetermined activities upon sending of the particular notification and the second reward estimate of the target user engaging in the one or more predetermined activities without sending of the particular notification  being greater than the predetermined threshold.].)
Cal teaches the learning system that trains a model using reinforcement learning techniques using State, Action, Reward, State, Action, in [0064] that are usually captured as tuples in computing applications, and depicted in Figs. 5A-E for generating notifications to the user using feedback information, in [0076]-[0081]. Cal does not expressly disclose the data set as a set of tuples.
 Tom does expressly disclose the data set as a set of tuples as recited in claim 1 limitation:
accessing a data set comprising tuples, (Tom teaches accessing data as observation tuples characterizing a state corresponding to a performed action by an agent under observation and reward in response to a performed action, in [0011] In some implementations, each piece of experience data is an experience tuple [i.e. accessing a data set comprising tuples] that includes a respective current observation characterizing a respective current state of the environment, a respective current action performed by the agent in response to the current observation, a respective next state characterizing a respective next state of the environment, and a reward received in response to the agent performing the current action)
training a recurrent machine-learning model using the data set, (Tom teaches where the observations are used to train the reinforcement learning model using the training engine, in [0030]-[0032] The reinforcement learning system 100 selects actions to be performed by a reinforcement learning agent 102 interacting with an environment 104. That is, the reinforcement learning system 100 receives observations, with each observation characterizing a respective state of the environment 104, and, in response to each observation, selects an action from a predetermined set of actions to be performed by the reinforcement learning agent 102 in response to the observation [training a recurrent machine-learning model using the data set]. In response to some or all of the actions performed by the agent 102, the reinforcement learning system 100 receives a reward. Each reward is a numeric value received from the environment 104 as a consequence of the agent performing an action, i.e., the reward will be different depending on the state that the environment 104 transitions into as a result of the agent 102 performing the action. In particular, the reinforcement learn­ing system 100 selects actions to be performed by the agent 102 using an action selection neural network 110 and a training engine 120. The action selection neural network 110 is a neural network that receives as an input an observation about the state of the environment 104 and generates as an output a respective Q value for each action, i.e., a prediction of expected return resulting from the agent 102 performing the action in response to the observation [a cumulative reward over a plurality of time periods in the observation] To allow the agent 102 to effectively interact with the environment 104, the reinforcement learning system 100 includes a training engine 120 that trains the action selection neural network 110 to determine trained values of the parameters of the action selection neural network 110.; where the reinforcement learning model training the model to make the prediction over a plurality of time periods as the temporal difference learning error, in [0039]-[0041]: … The system also maintains (in the replay memory or in a separate storage component) an expected learning progress measure for some or all of the pieces of experience data. An expected learning progress measure associated with a piece of experience data is a measure of the expected amount of progress made in the training of the neural network if the neural network is trained using the piece of experience data… the system determines an expected learning progress measure associated with an experience tuple based on a previously calculated temporal difference error for the experience tuple, i.e., the temporal difference error from the preceding time the experience tuple was used in training the neural network… the expected learning progress measure is an absolute value of the temporal difference learning error determined for the experience tuple the preceding time the experience tuple was used in training the neural network. In some implementations, the expected learning progress measure is derivative of an absolute value of a temporal difference learning error determined for the experience tuple the preceding time the experience tuple was used in training the neural network. 
The Cal and Tom references would have been recognized by those of ordinary skill in the art as useful for applicant’s purpose in developing information processing method using machine learning techniques.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to integrate the method for observing interaction learning prediction using reinforcement learning techniques as disclosed by Tom with the method of information processing using machine learning techniques as disclosed by Cal.
One of ordinary skill in the arts would have been motivated to integrate the disclosed methods by Cal and Tom in order to enable reinforcement learning systems to respond to received observations using neural networks to predict output from a received input (Tom, [0002]-[0005]); doing so will realize the following advantages: “training data from a replay memory [that] can be selected in a way that increases the value of the selected data for training a neural network. This can, in turn, increase the speed of training of neural networks used in selecting actions to be performed by agents and reduce the 
	While Cal and Tom disclose the reinforcement learning techniques for modeling and learning sequential patterns using a Q-learning method by accounting for the temporal differences to determine a threshold. The references do expressly describe this inherent feature of the Q-learning process for accounting for claimed temporal differences as part of the Q-learning process.
	Ant does the use of the temporal differences as part of the Q-learning process. (in Sec. 2: Background: In RL, the learner’s environment is modelled as a Markov Decision Process (MDP). An MDP speciﬁes a set of states and a set of actions. At each step in time the process is in some state, and an action must be chosen; this action has the effect of changing the current state and producing a scalar reinforcement value. The reinforcement value, or reward, represents the extent to which we can consider the action to have had immediately desirable or undesirable consequences. Formally, an MDP is described by a 4-tuple; And including a first estimated reward 
    PNG
    media_image7.png
    129
    725
    media_image7.png
    Greyscale
; And a second estimated reward, as  r π , received at a future time step t and computing the discount criterion as the claimed  threshold criterion, in pg. 2: Left Col. …The goal of RL methods is to arrive, by performing actions and observing their outcomes, at a policy which maximizes the rewards accumulated over time.
Q-learning [17] is the most widely used and studied method for RL and, like most, it uses discounted value as its criterion of optimality [claimed threshold criterion]. The discounted value, or discounted return, of a policy in a state is deﬁned as the expected value     
    PNG
    media_image8.png
    93
    517
    media_image8.png
    Greyscale


    PNG
    media_image9.png
    571
    714
    media_image9.png
    Greyscale

)
The Cal, Tom, and Ant references would have been recognized by those of ordinary skill in the art as useful for applicant’s purpose in developing information processing method using machine learning techniques.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to integrate the method for using Q-learning optimization criterion method for reinforcement learning as disclosed by Ant with the method of information processing using machine learning techniques as collectively disclosed by Cal and Tom.
One of ordinary skill in the arts would have been motivated to integrate the disclosed methods by Ant, Cal and Tom in order to enable reinforcement learning techniques for ac-accomplishing a process for learning estimate rewards that performing an action may reap over the course of time, so as to choose actions maximizing that measure (Ant, Sec. I: Introduction); doing so will help boost performance of the reinforcement  learning algorithm (Ant, Sec. I: Introduction).
While Cal teaches in combination with Tom and Ant the process for using reinforcement learning to compare differences in observations based on Q-learning state-action reward sequences for 
Zhu teaches use of notations to capture the ordering of the state reward message and behaviors to help filter rewards associated with sent content and content not made available to the user at a state noted with an index as done in noting algorithms in computing applications using reinforcement learning techniques. (in 0050-00: FIG. 3 is a schematic diagram of a model of an MOP provided by the present disclosure. As shown in FIG. 3, the MOP involves two entities, i.e., an agent 302 and an environment 304, that interact with each other. The Agent is an entity that makes decisions. The environment is an entity for information feedback. For example, in the application scenario of product recommendation technology, the Agent may be set as the main subject for making product recom­mendation decisions, and the environment may be set to feedback the user's behavior of clicking browsed products and purchasing products to the Agent. MOP may be repre­sented by a four-tuple<S, A, R, T> (1) S is a State Space, which contain a set of environmental states that the Agent may perceive. 
(2) A is an Action Space, which contain the set of actions the Agent may take on each state of the environment. (0053] (3) R is a Rewarding Fm1ction, and R (s, a, s') represents the reward that the Agent obtains from the environment when the action a is performed on the state s and the state is changed to state s' ( 4) T is the State Transition Function, and T (s, a, s') can represent the probability of executing action a on state s and moving to state s'. [i.e. a (1) first action, a (2) first state, a (3) first reward, a (4) second action, and a (5) second state, wherein a first time period], ,… Agent senses that the environment state at time t is s,. Based on the environment state st, the Agent may select an action at from the action space A to execute…; And in 0061-0091: In combination with the above MDP model, the recommendation server 210 corresponds to the Agent, and the current link state of the user corresponds to the state s. The Agent determines the current state s, and according to a certain strategy, outputs the corresponding action a. Cor­respondingly, the recommendation server 210 may provide the recommended behavior according to a certain recom­mendation strategy and the current link status of the user. In this example embodiment, the link status may include a plurality of key operation behaviors of the user within a preset time interval that are ranked based on time sequence [i.e.  being greater than the predetermined threshold]… In an example embodiment, the key operation page may include a page with an influence factor greater than a preset threshold on the preset user behavior [i.e. computing device, sending the particular notification to the target user based on the difference between the first reward estimate of the target user engaging in the one or more predetermined activities upon sending of the particular notification ]. The influence factor may include a value of influence on a preset user behavior, and the preset user behavior may include a user transaction decision … In an example embodiment of the present disclo­sure, the plurality of key operation pages may be selected through a product category identifier and a key operation page identifier. For example, the product category identifier may include a product category ID… In another example embodiment of the present disclosure, a plurality of preliminary operation behaviors associated with the key operation page may be firstly screened out [and the second reward estimate of the target user engaging in the one or more predetermined activities without sending of the particular notification  being greater than the predetermined threshold] from the plurality of operation behaviors and then the plurality of preliminary operation behaviors asso­ciated with the particular product category are screened out from the plurality of preliminary operation behaviors. As shown in FIG. 10, the S404 may include: Sl 002: for a specific key operation page, a plurality of preliminary operation behaviors that are associated with the specific key operation page are filtered from the plurality of operation behaviors;


It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to integrate the method for observing interaction learning prediction using reinforcement learning techniques as disclosed by Zhu with the method of information processing using machine learning techniques as disclosed by Cal, Tom, and Ant.
One of ordinary skill in the arts would have been motivated to integrate the disclosed methods by Zhu, Cal, Tom, and Ant. in order to enable data analysis server that performs learning processing on the key operation behavior by using a reinforcement learning method to obtain a product recommendation strat­egy for the user.  (Zhu, [0012]); doing so provides an intelligent recommendation method  for using reinforcement learning algorithms that can be applied to the key operation behavior sequence “to learn more accurately the user preferences, intentions, and other information to improve the accuracy of product recommendation. In addi­tion, the extraction and dimension reduction of multiple operational behaviors also further enhance the efficiency of learning” (Zhu, 0017).
Examiner notes that the original disclosure supports the interpretation of a recurrent machine learning model as claimed to be inclusive of reinforcement learning models as taught by the cited references. The applicant can be his/her own lexicographer and as claimed and in light of applicant’s specification [20]-[23]: … One advantage of the recurrent model is that the recurrent definition may be a better approximation of a future reward than any single example, so the recurrent model would provide a more accurate result The recurrent model V(st, aJ may be defined as follows: … V(st, at) = R(st, a) + y * V(st+1, at+1) where St represents a state on day t ( or any other time-period unit), at represents an action taken by the system on day t, R(st, aJ represents the reward on day t, and y represents a decay function as described above. In this model, the rewards associated with time periods beyond t are V(st+1, at+1)… The training data may include any number of tuples ( or data points). In particular embodiments, each tuple in the training data set may be associated with a page-admin pair and include the data (St, At, Rt, St+l, At+i), …;And claims must be given their broadest reasonable interpretation in light of applicant specification, see MPEP 2111. Should the applicant want the use of a recurrent neural network, that needs to included as part of the claim language as claim limitation can be imported from the specification, see MPEP. 2111. 

Regarding claim 2, the rejection of claim 1 is incorporated and Cal in combination with Tom, Ant, and Zhu further teaches the method of claim 1,
wherein the definition of the cumulative reward further comprises a decay function applied to the application of the recurrent machine-learning model to the subsequent state and the subsequent action. (Cal teaches  defining the reward feedback metric , e.g. cumulative reward associated with the sequential state-action pairs, in [0064] In some embodiments, a suitable approxi­mation technique for the value function of the model involves temporal difference (TD) reinforcement learning which may be formulated to take advantage of the so-called eligibility trace λ [wherein the application of the trace function corresponds to the decay function applied to the recurrent model]. The eligibility trace λ may be a number between [O, 1] which may be used to weight the relevance of past steps ( e.g. past states) to a current outcome (e.g. feedback). Where λ=O, only the most recent state-action pair in the user action log would be updated based on a combination of its existing value and the new value determined by the feedback metric of the block 306 feedback-generating action. Where λ=l, all of the preceding state-action pairs in the user action log would be updated based on corresponding combinations of their exist­ing values [a decay function applied to the ] and the new value determined by the feedback metric of the block 306 feedback-generating action.)

Regarding claim 5, the rejection of claim 1 is incorporated and Cal in combination with Tom, Ant, and Zhu further teaches the method of claim 1,
wherein during the training: the first action and the first state of each of the tuples in the data set are used as inputs to the recurrent machine-learning model, and the first reward, the second state, and the second action of each of the tuples in the data set are used to define a label for the tuple (Cal teaches that the resources can be associated with a label that is the captured in the user log activity table used as inputs to the recurrent machine learning model in [0054] FIG. 3B is a schematic resource-space diagram 255 corresponding to the FIG. 3A action log 250. In the FIG. 3B representation 255, each square corresponds to a resource item and is labeled with its corresponding resource ID [the second action of each of the tuples in the data set are used to define a label for the tuple associated with the second entry in the action log label with the corresponding resource ID]. The circles in FIG. 3B represent actions that the user has done and the dashed line represents the path that the user took between actions. FIG. 3B shows that the user progressed through interacting with resource items 1, 3, 4, 7 and 6 before taking assessment A4.; where the tables used to define the resource, labels are used to train the values of the table using a reinforcement learning technique, in [0063]-[0064]  In currently preferred embodiments, the block 308 process of updating the state table involves the application of reinforcement learning techniques. In some embodiments, the block 308 process of updating the state table may involve the two step process of: selecting the state table records to update (as shown in optional block 308A of the illustrated embodiment); and selecting one or more new values for each selected state table record (as shown in optional block 308B of the illustrated embodiment)… In some embodiments, updating the values of the selected state table records in block 308B involves using the state table in a model of a finite Markov Decision Process (MDP) and using a reinforcement learning technique to approximate solutions for updating the values of the selected state table records [the first action and the first state of each of the tuples in the data set are used as inputs to the recurrent machine-learning mode]. In some embodiments, a suitable approxi­mation technique for the value function of the model involves temporal difference (TD) reinforcement learning which may be formulated to take advantage of the so-called eligibility trace A.)

Regarding claim 6, the rejection of claim 1 is incorporated and Cal in combination with Tom, Ant, and Zhu further teaches the method of claim 1,
wherein each of the tuples in the data set is associated with an administrator and online content managed by the administrator. (Cal teaches the use of learning management system an administrator and online content as the internet resources that can be managed (e.g. accessed) by the administrator, in [0029] … Leaning system 100 can access informa­tion resources from one or more information resource reposi­tories 150. In the illustrated embodiment, information resource repositories include the internet 150A, one or more general purpose information resource databases 150B and information resources which may be accessed from a learning management system 150C. In other embodiments, learning system 100 can interact with a different number (more or fewer) of information resource repositories, different types of information resource repositories and/or the like. And the assessment system administrator for managing the online assessments, in [0044]  … In some embodiments, browser activity data 152 may comprise grades for assessments taken by user 142. In other embodi­ments, an assessment manager 128 may be provided to deter­mine or otherwise obtain grades for assessments taken by user 142. As discussed in more detail below, such grades may be used by system 100 as feedback metrics. )
Examiner notes that Tom teaches the use of tuples in capturing data items, in 0011. It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the teachings of Cal and Tom for the same reasons disclosed above.

	 

Regarding claim 7, the rejection of claim 1 is incorporated and Cal in combination with Tom, Ant, and Zhu further teaches the method of claim 1,
wherein the first user is an administrator of a page on a social-networking system; (Cal teaches the first user as an administrator of the page of a social-networking system as the learning system depicted in Fig. 1 supporting content on user devices where the user administrator has access to the information resources from the system repositories, in [0044]  Additionally or alternatively, in some embodiments, user interface 154 to system 100 (or some other monitoring agent which may be present on the computing device of user 142) may operate passively while user 142 accesses informa­tion resources from repositories 150 via another independent application program (not expressly shown). In such embodi­ments, user interface 154 to system 100 may operate in par­allel with, or in the background of, the independent applica­tion. The independent application may provide an independent user interface through which user 142 accesses information resources from repositories 150. By way of non­limiting example, such an independent could comprise an internet browser or a LMS user interface application [wherein the first user is an administrator of a page on a social-networking system and actions are captured as user interaction with the browser page presented on the user device]… 
wherein the first state further represents characteristics associated with the page in the first time period; and wherein the second state further represents characteristics associated with the page in the second time period. (Cal teaches the user action characteristics captured as the resources associated with state-action log with the resource browser page as disclosed in [0044]; where the user log characteristics log is associated with first and second state associated with a respective time period as depicted in Fig. 3A and in [ in [0064]: In some embodiments, updating the values of the selected state table records in block 308B involves using the state table in a model of a finite Markov Decision Process (MDP) and using a reinforcement learning technique to approximate solutions for updating the values of the selected state table records [assessing a data set comprising tuples including the first and second action characteristics as the state-action pairs] In some embodiments, a suitable approxi­mation technique for the value function of the model involves temporal difference (TD) [wherein the first and second time period associated with the respective first and second states used to compute their respective time periods as temporal difference as depicted in Fig. 3A] reinforcement learning which may be formulated to take advantage of the so-called eligibility trace…; captured as a tuple table log elements of observation data records using a state table Q-learning technique, in [0064]…reinforcement learning technique known as Q-Leaning. In some embodiments, the block 308 process of updating the values of the selected state table records involves application of a rein­forcement learning process known as State-Action-Reward­State-Action ( or SARSA) Learning. In other embodiments, a Monte Carlo method may be used in the block 308 process of updating the values of the selected state table records. In other embodiments, a Monte Carlo method may be used in the block 308 process of updating the values of the selected state table records)

Regarding claim 8, the rejection of claim 7 is incorporated and Cal in combination with Tom, Ant, and Zhu further teaches the method of claim 7,
wherein the characteristics associated with the page in the first time period comprise a duration since the last time prior to the first time period the first user managed the page.  (Cal teaches the characteristics associated with the resource page accessed by the user (i.e. managed) as captured as a sequence recorded in the table as the time stamp as depicted in Fig. 3A, in [0052]-[0053] As discussed above, when a user 142 is interacting with resources from repositories 150, action interface 120 may keep a log of the user's actions in relation to the infor­mation resources. FIG. 3A is a schematic example of a user action log 250 suitable for use by action interface 120 accord­ing to a particular embodiment. It will be appreciated that there are many users who may be interacting with learning system 100 at any given time. User action log 250 of the FIG. 3A embodiment is sorted by a particular user ID field-i.e. user action log 250 shown in FIG. 3A represents the actions of a particular user having user ID=x. Learning system 100 may create a similar user action log for each user 142.  [0053] In the FIG. 3A embodiment, each row (e.g. each record) of user action log 250 represents one action and comprises four fields: user ID, state, action, start time stamp and end time stamp. The state field represents actions that the user has done prior to the action the user is currently perform­ing…)

Regarding claim 9, the rejection of claim 7 is incorporated and Cal in combination with Tom, Ant, and Zhu further teaches the method of claim 7,
wherein the measure of activity performed by the first user in the first time period is associated with whether the first user managed the page after the first notification was sent. (Cal teaches the measure of activity as the values of the state table records in Fig. 3A with a time stamp that is the first time period associated with the user interaction (i.e. managed) the page after the recommendation or assessment is sent to the user, in [0057] As discussed above, in the illustrated embodiment, state table 275 includes at least one value field 281 which may represent the value that system 100 associates with perform­ing a corresponding action, given a corresponding state. For example, the third record 285 of state table 275 has a value field 281 which indicates if a user had interacted with resource items with resource IDs={ 1, 4, 7} ( corresponding to the state field 277 of record 285) the next action of interacting with assessment A4 ( corresponding to the action field 279 of record 285) [wherein the measure of activity performed by the first user in the first time period is associated with the first user managed the page after the first notification was sent, by computing the value after the assessment notification has been sent and interacted (i.e. managed) by the first user] has a value of 0.63. Value field 281 may comprise a numerical metric, such that value fields 281 of particular state table records may be easily compared to one another. In the case of the FIG. 4A example, given a state 277 corre­sponding to a user having interacted with resource IDs={ 1, 7}, system 100 considers there to be relatively more value in the next action being interacting with resource ID=ll (value=0.99) than interacting with resource ID=3 (value=0. 72).)

Regarding claim 10, the rejection of claim 7 is incorporated and Cal in combination with Tom, Ant, and Zhu further teaches the method of claim 7,
wherein the measure of activity performed by the first user in the first time period is associated with an average number of times the page is managed in the first time period by each administrator of the page. (Cal teaches capturing the measure of activity performed by the first user action table log associated with an average number of time the page resource accessed, as the weighted average for each logged action managed by the user as an accessed resource in the first user’s log table  in the first time period captured by the timestamps as depicted in Fig. 3A, in [0084] If the block 458 inquiry is negative (i.e. there are no block 454 path records having action entries among the actions of the block 354 target state), then method 450 pro­ceeds to block 460. Block 460 involves setting aside the block 454 set of path records and generating a weighted average of the values for the state table records (within the block 353 subset) having an action in the block 354 target state [the measure of activity performed by the first user in the first time period is associated with an average number of times the page is managed in the first time period by each administrator of the page]…)

Regarding claim 11, the rejection of claim 1 is incorporated and Cal in combination with Tom, Ant, and Zhu further teaches the method of claim 1,
wherein the second state is different from the first state. (Cal teaches the second and first state as different states captured as a tuple table log elements of observation data records using a log table and Q-learning technique sequential process learning using state-action-reward-state-action including different first and second states and captured in the log table as depicted in Fig. 3A, in [0064]…reinforcement learning technique known as Q-Leaning. In some embodiments, the block 308 process of updating the values of the selected state table records involves application of a rein­forcement learning process known as State-Action-Reward­State-Action ( or SARSA) Learning [each tuple being associated with a first user and comprising a first action, a first state, a first reward, a second action, and a second state]. In other embodiments, a Monte Carlo method may be used in the block 308 process of updating the values of the selected state table records. In other embodiments, a Monte Carlo method may be used in the block 308 process of updating the values of the selected state table records; wherein a first and second time period associated with the respective first and second states used to compute the temporal difference as depicted in Fig. 3A.

Regarding claim 12, the rejection of claim 1 is incorporated and Cal in combination with Tom, Ant, and Zhu further teaches the method of claim 1,
wherein the second action is different from the first action. (Cal teaches the second and first actions as different actions captured as a tuple table log elements of observation data records using a log table and Q-learning technique sequential process learning using state-action-reward-state-action including different first and second actions and captured in the log table as depicted in Fig. 3A, in [0064]…reinforcement learning technique known as Q-Leaning. In some embodiments, the block 308 process of updating the values of the selected state table records involves application of a rein­forcement learning process known as State-Action-Reward­State-Action ( or SARSA) Learning [each ]. In other embodiments, a Monte Carlo method may be used in the block 308 process of updating the values of the selected state table records. In other embodiments, a Monte Carlo method may be used in the block 308 process of updating the values of the selected state table records; wherein a first and second time period associated with the respective first and second states used to compute the temporal difference as depicted in Fig. 3A.)

Regarding independent claim 13 limitations, Cal teaches one or more computer-readable non-transitory storage media embodying software that is operable when executed to: (Cal [0115] : Certain embodiments may be implemented as a computer program product that may include instructions stored on a machine-readable medium. These instructions may be used to program a general-purpose or special-purpose processor to perform the described operations. A machine­readable medium includes any mechanism for storing infor­mation in a form (for example, software, processing applica­tion) readable by a machine (for example, a computer). The machine-readable medium may include, but is not limited to, magnetic storage medium (for example, floppy diskette); optical storage medium (for example, CD-ROM); magneto­optical storage medium; read-only memory (ROM); random­access memory (RAM); erasable programmable memory (for example, EPROM and EEPROM); flash memory; or another type of medium suitable for storing electronic instructions.)
Claim limitations are similar to claim 1 limitations and are rejected under the same rationale. 

Regarding claim 14, the rejection of claim 13 is incorporated.
Claim limitations are similar to claim 2 limitations and are rejected under the same rationale. 

Regarding independent claim 17 limitations, Cal teaches a system comprising: one or more processors; and one or more computer-readable non-transitory storage media coupled to one or more of the processors and comprising instructions operable when executed by one or more of the processors to cause the system to: (Cal [0115] : Certain embodiments may be implemented as a computer program product that may include instructions stored on a machine-readable medium. These instructions may be used to program a general-purpose or special-purpose processor to perform the described operations. A machine­readable medium includes any mechanism for storing infor­mation in a form (for example, software, processing applica­tion) readable by a machine (for example, a computer). The machine-readable medium may include, but is not limited to, magnetic storage medium (for example, floppy diskette); optical storage medium (for example, CD-ROM); magneto­optical storage medium; read-only memory (ROM); random­access memory (RAM); erasable programmable memory (for example, EPROM and EEPROM); flash memory; or another type of medium suitable for storing electronic instructions.)
Claim limitations are similar to claim 1 limitations and are rejected under the same rationale. 

Regarding claim 18, the rejection of claim 17 is incorporated.
Claim limitations are similar to claim 2 limitations and are rejected under the same rationale. 

Regarding claim 19, the rejection of claim 17 is incorporated.
Claim limitations are similar to claim 3 limitations and are rejected under the same rationale.

Regarding claim 20, the rejection of claim 19 is incorporated.
Claim limitations are similar to claim 4 limitations and are rejected under the same rationale.

Regarding claims 21, 23, and 25 the rejection of claims 1, 13, and 17 is incorporated respectively. Cal in combination with Tom, Ant, and Zhu teach the limitation: wherein generating the first reward estimate or the second reward estimate is further based on a particular state of the target user in the particular time period. (claimed first estimate is the disclosed R based on the state action of the target user at the time period before the rewards was estimated using reinforcement learning, as depicted in Fig. 3A, in [0064]…reinforcement learning technique known as Q-Leaning. In some embodiments, the block 308 process of updating the values of the selected state table records involves application of a rein­forcement learning process known as State-Action-Reward­State-Action ( or SARSA) Learning [each tuple being associated with a first user and comprising a first action, a first state, a first reward, a second action, and a second state]. In other embodiments, a Monte Carlo method may be used in the block 308 process of updating the values of the selected state table records; And the state-action pair associated with the target user of the model group of user as depicted in Fig. 5D for the target user as the personalized user in the modeled group of user, in 0028: … In particular embodiments, reinforcement learning techniques may use feedback to ascribe, or otherwise determine, one or more values for state­action pairs. State-action pairs together with their corre­sponding values may be maintained in a state table. Such a state table may be used as a basis for providing information about recommended actions to a variety of users. Information provided (possibly including recommended actions) may be personalized for individual users and/or groups of users…)

Regarding claims 22, 24, and 26 the rejection of claims 21, 23, and 25 is incorporated respectively. Cal in combination with Tom, Ant, and Zhu teach the limitation: wherein the particular state remains the same when generating the first reward estimate and the second reward estimate. (Q-learning reinforcement learning as claimed SARSA as a sequence of model events having claimed same state when generating claimed first and second reward, in 0064: In some embodiments, updating the values of the selected state table records in block 308B involves using the state table in a model of a finite Markov Decision Process (MDP) and using a reinforcement learning technique to approximate solutions for updating the values of the selected state table records… the selected state table records involves application of a rein­forcement learning technique known as Q-Learning. In some embodiments, the block 308 process of updating the values of the selected state table records involves application of a rein­forcement learning process known as State [claimed same state]-Action-Reward [claimed first reward]­State-Action [action associated with claimed second reward estimate proceeding the same state] ( or SARSA) Learning.; And where the second reward estimate as the weighted vales associated with a preceding action of prior state-action pair, in 0083: Method 450 then proceeds to block 458 which involves an inquiry into whether any of the block 454 set of path records have corresponding actions that are in the block 354 target state (see FIG. 5B). As discussed above, the block 354 target state comprises one or more action entries. If any of these action entries of the block 354 target state correspond to the action field of the block 454 set of path records, then the block 458 inquiry is positive. Otherwise, the block 458 inquiry is negative… If the block 458 inquiry is negative (i.e. there are no block 454 path records having action entries among the actions of the block 354 target state) [a second-action representation of not sending the particular notification to the target user in the particular time period], then method 450 pro­ceeds to block 460. Block 460 involves setting aside the block 454 set of path records and generating a weighted average [claimed second reward estimate representing a second reward estimate representing a second measure of the target user engaging in the one or more predetermined activities at the future time period based on at least a second-action representation of not sending the particular notification to the target user in the particular time period] of the values for the state table records (within the block 353 subset) having an action in the block 354 target state…)
Additionally, Ant teaches the sequential learning process associated with Q-learning, (in Sec. 2: Background: In RL, the learner’s environment is modelled as a Markov Decision Process (MDP). An MDP speciﬁes a set of states and a set of actions. At each step in time the process is in some state, and an action must be chosen; this action has the effect of changing the current state and producing a scalar reinforcement value. The reinforcement value, or reward, represents the extent to which we can consider the action to have had immediately desirable or undesirable consequences. Formally, an MDP is described by a 4-tuple; And including a first estimated reward 
    PNG
    media_image7.png
    129
    725
    media_image7.png
    Greyscale
; And a second estimated reward, as  r π , received at a future time step t and computing the discount criterion as the claimed  threshold criterion, in pg. 2: Left Col. …The goal of RL methods is to arrive, by performing actions and observing their outcomes, at a policy which maximizes the rewards accumulated over time.
Q-learning [17] is the most widely used and studied method for RL and, like most, it uses discounted value as its criterion of optimality [claimed threshold criterion]. The discounted value, or discounted return, of a policy in a state is deﬁned as the expected value     
    PNG
    media_image8.png
    93
    517
    media_image8.png
    Greyscale
)

Alternatively Claims 1, 13 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Collison et al. (US Pub. No. 2012/0310961, hereinafter ‘Cal’) in view of Schlau et al. (US Pub. No. 2017/0140269, hereinafter ‘Tom’) in further view of Schwartz (NPL: A reinforcement learning method for maximizing undiscounted rewards, hereinafter ‘Ant’) in further view of Stenudd (NPL: “Using machine learning in the adaptive control of a smart environment, hereinafter ‘Sten’).

Regarding independent claim 1 limitations, Cal teaches a method comprising:
by a computing system, accessing a data set comprising … for training a recurrent machine-learning model, each … for training the recurrent machine-learning model being associated with a first user and comprising a (1) first action, a (2) first state, a (3) first reward, a (4) second action, and a (5) second state, wherein a first time period is associated with the first state, the first action, and the first reward, wherein a second time period following the first time period is associated with the second state and the second action, (using reinforcement learning machine learning techniques using SARSA as the claimed tuples, in 0064: In some embodiments, updating the values of the selected state table records in block 308B [i.e. accessing a data set comprising … for training a recurrent machine-learning model, each … for training the recurrent machine-learning model being associated with a first user and comprising a (1) first action, a (2) first state, a (3) first reward, a (4) second action, and a (5) second state] involves using the state table in a model of a finite Markov Decision Process (MDP) and using a reinforcement learning technique  [i.e. accessing a data set comprising … for training a recurrent machine-learning model, each … for training the recurrent machine-learning model being associated with a first user and comprising a (1) first action, a (2) first state, a (3) first reward, a (4) second action, and a (5) second state] to approximate solutions for updating the values of the selected state table records. In some embodiments, a suitable approxi­mation technique for the value function of the model involves temporal difference (TD) reinforcement learning [i.e. a recurrent machine-learning model, each … for training the recurrent machine-learning model being associated with a first user and comprising a (1) first action, a (2) first state, a (3) first reward, a (4) second action, and a (5) second state, wherein a first time period is associated with the first state, the first action, and the first reward, wherein a second time period following the first time period is associated with the second state and the second action] which may be formulated to take advantage of the so-called eligibility trace … In some embodiments, the block 308 process of updating the values of the selected state table records involves application of a rein­forcement learning technique known as Q-Learning [i.e. comprising a (1) first action, a (2) first state, a (3) first reward, a (4) second action, and a (5) second state]. In some embodiments, the block 308 process of updating the values of the selected state table records involves application of a rein­forcement learning process known as State-Action-Reward­State-Action ( or SARSA) [claimed tuples: comprising a (1) first action, a (2) first state, a (3) first reward, a (4) second action, and a (5) second state]…;Cal teaches the model table data as the assessed  dataset comprising tuples using machine learning techniques and reinforcement learning techniques by the computing system depicted in Fig. 1,
[AltContent: textbox ([img-media_image2.png])]






 in [0064]: In some embodiments, updating the values of the selected state table records in block 308B involves using the state table in a model of a finite Markov Decision Process (MDP) and using a reinforcement learning technique to approximate solutions for updating the values of the selected state table records [for training the recurrent machine-learning model being associated with a first user and comprising a (1) first action, a (2) first state, a (3) first reward, a second action, and a (2) second state, wherein a first time period is associated with the first state, the first action, and the first reward, wherein a second time period following the first time period is associated with the second state and the second actions] In some embodiments, a suitable approxi­mation technique for the value function of the model involves temporal difference (TD) [wherein a first and second time period associated with the respective first and second states used to compute the temporal difference as depicted in Fig. 3A] reinforcement learning which may be formulated to take advantage of the so-called eligibility trace…; captured as a tuple table log elements of observation data records using a state table Q-learning technique, in [0064]…reinforcement learning technique known as Q-Leaning. In some embodiments, the block 308 process of updating the values of the selected state table records involves application of a rein­forcement learning process known as State-Action-Reward­State-Action ( or SARSA) Learning [each tuple being associated with a first user]…; wherein a first and second time period associated with the respective first and second states used to compute the temporal difference as depicted in Fig. 3A: 
[AltContent: textbox ([img-media_image3.png])]



where a first and second time period associated with the respective first and second states captured by the time stamp field (start & end) as the recited first and second time period as depicted in Fig. 3A, in [0047]: Action interface 120 may track the actions of users 142 in relation to the resources in repositories 150 using an action log which may be stored in action database 124… In some embodiments, time stamp field(s) could comprise a single time stamp field indicating that the user accessed the information resource at a particular time or for a particular duration…)
wherein the (1) first action represents the computing system sending or not sending a first notification to the first user in the first time period, (Cal teaches the action representing a user notification was sent as a captured user interaction actions with the online resources associated with an assessment or user actions interacting with a generated recommendation as the first action that represents whether a first notification was sent to the first user in the first time period as depicted in Figs. 3A and 5B, in [0028] The set of accessible information resources may be referred as a state space and information about current posi­tion of a user in the state-space (e.g. a history of the informa­tion resources with which the user has interacted) may be referred to as the user's state. To move from one state to another within the state-space, a user interacts with an infor­mation resource. Such interaction of the user with an infor­mation resource may be referred to as an action… In particular, embodiments, reinforcement-learning techniques may use feedback to ascribe, or otherwise determine, one or more values for state­action pairs. State-action pairs together with their corre­sponding values may be maintained in a state table. Such a state table may be used as a basis for providing information about recommended actions to a variety of users. Information provided (possibly including recommended actions) may be personalized for individual users and/or groups of users [i.e. wherein the (1) first action represents the computing system sending or not sending a first notification to the first user in the first time period; and wherein the first action represents the computing system sending or not sending a first notification to the first user in the first time period as the personalized recommended action captured in the user table]…. [0029] … In the illustrated embodiment, information resource repositories include the internet 150A, one or more general purpose information resource databases 150B and information resources which may be accessed from a learning management system 150C…; where the information in the repository on interactive user resources may be pre-organized, that is displayed as sent notifications to the user as instructions, questions, directives, interactive resources in [0031]: Information resource repositories 150 may hold a wide variety of information resources having a corresponding wide variety of forms. By way of non-limiting example, infor­mation resources can comprise textual resources, audio resources, image-based resources, video resources, interac­tive resources, questions, assessments [wherein the first action represents the computing system sending or not sending a first notification to the first user in the first time period,], executable applica­tions, instructions or directives on how to access and/or use other resources, discussion posts or forums, instructor notes, hints, blogs, any combinations or sub-combinations of these types of resources and/or the like. In general, learning system 100 can accommodate any form of informational resource. In some types of repositories 150 (such as database 150B or learning management system 150C), information resources may be pre-organized or otherwise mapped or classified in some manner within the repository prior to being made acces­sible to learning system 100 [pre-organized, that is displayed as sent notifications to the user as instructions, questions, directives, interactive resources]…; including assessment and initial recommendations associated the first  user’s monitored first action as depicted in Fig. 5B:
[AltContent: textbox ([img-media_image4.png])]








;Cal teaches the fist state represents the movement in the state space associated with the user associated first action as depicted in Fig. 3A, in [0028]: The set of accessible information resources may be referred as a state space and information about current posi­tion of a user in the state-space (e.g. a history of the informa­tion resources with which the user has interacted) may be referred to as the user's state [the user state that represents a characteristic associated with the first user as indicated by the user ID (including a first and second action data captured in the user’s action log) ]. To move from one state to another within the state-space, a user interacts with an infor­mation resource [wherein user log table captures the state representing the move characteristics of the user that includes the activity sequences in the first state that represents characteristics associated with the first user in the first time period, as the characterized sequence user activities in associated with the state log of the first user, i.e. wherein the (1) first action represents the computing system sending or not sending a first notification to the first user in the first time period]. Such interaction of the user with an infor­mation resource may be referred to as an action… In particular, embodiments, reinforcement-learning techniques may use feedback to ascribe, or otherwise determine, one or more values for state­action pairs. State-action pairs together with their corre­sponding values may be maintained in a state table. Such a state table may be used as a basis for providing information about recommended actions to a variety of users. Information provided (possibly including recommended actions) may be personalized for individual users and/or groups of users.; Examiner notes that the table documents the actions of sending and not sending a notification based on noted resource or assessment notification ID for an action state as shown in exemplar table entry depicted in Fig. 4A)
wherein the (2) first state represents characteristics associated with the first user in the first time period, wherein the (3) first reward represents a measure of activity performed by the first user in the first time period, (Cal teaches the reinforcement learning having the first reward  as that represents a measure of activity preformed by the first user in the claimed  first time period in selected in the user state table records as depicted in Fig. 3, in [0064] In some embodiments, updating the values of the selected state table records in block 308B involves using the state table in a model of a finite Markov Decision Process (MDP) and using a reinforcement learning technique to approximate solutions for updating the values of the selected state table records [capturing measure of activity performed by the first user associated the data record ad depicted in Fig. 3A; i.e. wherein the (2) first state represents characteristics associated with the first user in the first time period, wherein the (3) first reward represents a measure of activity performed by the first user in the first time period] In some embodiments, a suitable approxi­mation technique for the value function of the model involves temporal difference (TD) [activity in the first time period as the temporal differences in the user activity time stamp in the user action log table as depicted in Fig. 3A; i.e. wherein the (2) first state represents characteristics associated with the first user in the first time period, wherein the (3) first reward represents a measure of activity performed by the first user in the first time period]…the block 308 process of updating the values of the selected state table records involves application of a reinforcement learning technique known as Q-Leaning [wherein the (2) first state represents characteristics associated with the first user in the first time period, wherein the (3) first reward represents a measure of activity performed by the first user in the first time period]. In some embodiments, the block 308 process of updating the values of the selected state table records involves application of a rein­forcement learning process known as State-Action-Reward­State-Action ( or SARSA) Learning [each tuple being associated with a first user and comprising a first action, a first state, a first reward that represents the ]; where the rewards represents the measure of activity as the values associated with state-action pairs of the first user as disclosed in in [0027] …Feedback may comprise one or more feedback metrics which may be related to a user's interaction with the information resources. In particular embodiments, feedback comprises an assessment that comprises a feed­back metric (or metrics). In particular, embodiments, rein-forcement-learning techniques use these feedback metrics to ascribe, or otherwise determine, one or more values for an action or a series of actions taken by a user in connection with the information resources. Such values may be used as esti­mates of the value of the same action or series of actions for other users… [0057] As discussed above, in the illustrated embodiment, state table 275 includes at least one value field 281 which may represent the value that system 100 associates with perform­ing a corresponding action, given a corresponding state. For example, the third record 285 of state table 275 has a value field 281 which indicates if a user had interacted with resource items with resource IDs={ 1, 4, 7} ( corresponding to the state field 277 of record 285), the next action of interacting with assessment A4 ( corresponding to the action field 279 of record 285) has a value of0.63. Value field 281 may comprise a numerical metric, such that value fields 281 of particular state table records may be easily compared to one another. In the case of the FIG. 4A example, given a state 277 corre­sponding to a user having interacted with resource IDs={ 1, 7}, system 100 considers there to be relatively more value in the next action being interacting with resource ID=ll (value=0.99) than interacting with resource ID=3 (value=0. 72).)
wherein the (4) second action represents the computing system sending or not sending a second notification to the first user in the second time period following the first time period, (Cal teaches the second action representing for a second state-action pair in the user log that represents sending a second notification us sent as part of the activity log associated with a resource recommendation, personalized prompts, or assessments captured in the user activity log depicted in Fig. 3A, in [0028]: The set of accessible information resources may be referred as a state space and information about current posi­tion of a user in the state-space (e.g. a history of the informa­tion resources with which the user has interacted) may be referred to as the user's state [the user state that represents a characteristic associated with the first user as indicated by the user ID (including a first and second action data capturing the users sequential activity as stat-action pairs in a respective time period captured by the time stamp data in the user’s action log), i.e. wherein the (4) second action represents the computing system sending or not sending a second notification to the first user in the second time period following the first time period,]. To move from one state to another within the state-space, a user interacts with an infor­mation resource. Such interaction of the user with an infor­mation resource may be referred to as an action… In particular, embodiments, reinforcement learning techniques may use feedback to ascribe, or otherwise determine, one or more values for state­action pairs. State-action pairs together with their corre­sponding values may be maintained in a state table. Such a state table may be used as a basis for providing information about recommended actions to a variety of users. Information provided (possibly including recommended actions) may be personalized for individual users and/or groups of users [the user state that represents a characteristic associated with the first user as indicated by the user ID (including a first and second action data capturing the users sequential activity as stat-action pairs in a respective time period captured by the time stamp data in the user’s action log)]…” capturing the plurality of user activities including a first and second actions associated with a notification sent to the user as disclosed in [0057] As discussed above, in the illustrated embodiment, state table 275 includes at least one value field 281 which may represent the value that system 100 associates with perform­ing a corresponding action, given a corresponding state. For example, the third record 285 of state table 275 has a value field 281 which indicates if a user had interacted with resource items with resource IDs={ 1, 4, 7} ( corresponding to the state field 277 of record 285), the next action of interacting with assessment A4 ( corresponding to the action field 279 of record 285) has a value of0.63 [capturing the plurality of user activities including a first and second actions associated with a notification sent to the user as part of the respective sequence of action log entries]. Value field 281 may comprise a numerical metric, such that value fields 281 of particular state table records may be easily compared to one another. In the case of the FIG. 4A example, given a state 277 corre­sponding to a user having interacted with resource IDs={ 1, 7}, system 100 considers there to be relatively more value in the next action being interacting with resource ID=ll (value=0.99) than interacting with resource ID=3 (value=0. 72). Examiner notes that the table documents the actions of sending and not sending a notification based on noted resource or assessment notification ID for an action state as shown in exemplar table entry depicted in Fig. 4A [i.e. Examiner notes that the table documents the actions of sending and not sending a notification based on noted resource or assessment notification ID for an action state as shown in exemplar table entry depicted in Fig. 4A including i.e. wherein the (4) second action represents the computing system sending or not sending a second notification to the first user in the second time period following the first time period,])
and wherein the (5) second state represents characteristics associated with the first user in the second time period following the first time period; (Cal teaches the second state represents the movement in the state space associated with the user associated second action log entry captured in the user action log depicted in Fig. 3A, in [0028]: The set of accessible information resources may be referred as a state space and information about current posi­tion of a user in the state-space (e.g. a history of the informa­tion resources with which the user has interacted) may be referred to as the user's state [the user state that represents a characteristic associated with the first user as indicated by the user ID (including a first and second action data captured in the user’s action log);i.e. and wherein the (5) second state represents characteristics associated with the first user in the second time period following the first time period; ]. To move from one state to another within the state-space, a user interacts with an infor­mation resource. Such interaction of the user with an infor­mation resource may be referred to as an action… In particular, embodiments, reinforcement learning techniques may use feedback to ascribe, or otherwise determine, one or more values for state­action pairs. State-action pairs together with their corre­sponding values may be maintained in a state table [i.e. and wherein the (5) second state represents characteristics associated with the first user in the second time period following the first time period]. Such a state table may be used as a basis for providing information about recommended actions to a variety of users. Information provided (possibly including recommended actions) may be personalized for individual users and/or groups of users.; Examiner notes that claimed time periods is captured as part of the reinforcement learning techniques, in [0064])
by the computing system, training  the recurrent machine-learning model using the data set, wherein the recurrent machine-learning model is configured to take as inputs a state and an action and predict a cumulative reward over a plurality of time periods, wherein the state and the action are associated with a user and a time period, (Cal teaches the training  a model to configured to take inputs as depicted in figs. 2 & 3 using the learning management system depicted in Fig. 1, in [0064] In some embodiments, updating the values of the selected state table records in block 308B involves using the state table in a model of a finite Markov Decision Process (MDP) and using a reinforcement learning technique to approximate solutions for updating the values of the selected state table records [computing system training a recurrent machine learning model corresponding to the state table model of a finite Markov Decision Process using reinforcement learning techniques]. In some embodiments, a suitable approxi­mation technique for the value function of the model involves temporal difference (TD) reinforcement learning [i.e. by the computing system, training  the recurrent machine-learning model using the data set, wherein the recurrent machine-learning model is configured to take as inputs a state and an action and predict a cumulative reward over a plurality of time periods, wherein the state and the action are associated with a user and a time period] which may be formulated to take advantage of the so-called eligibility trace A. The eligibility trace A may be a number between [O, 1] which may be used to weight the relevance of past steps (e.g. past states) to a current outcome (e.g. feedback), [the cumulative reward values in depicted in Fig. 4A as s current outcome]. Where A=O, only the most recent state-action pair in the user action log would be updated based on a combination of its existing value and the new value determined by the feedback metric of the block 306 feedback-generating action [the predicted cumulative reward values in depicted in Fig. 4A as updated outcome from the trained prediction model over the plurality of time periods associated with the existing values]. Where A=l, all of the preceding state-action pairs in the user action log would be updated based on corresponding combinations of their exist­ing values [wherein the recurrent machine learning model takes inputs as state and action and the reward  in the user activity data table in Fig. 3A to predict the cumulative reward values in depicted in Fig. 4A as s current outcome] and the new value determined by the feedback metric of the block 306 feedback-generating action…; where the rewards are captured as the numerical values associated with the sequence of user actions as cumulative rewards as disclosed, in [0057] As discussed above, in the illustrated embodiment, state table 275 includes at least one value field 281 which may represent the value that system 100 associates with perform­ing a corresponding action, given a corresponding state [i.e. by the computing system, training  the recurrent machine-learning model using the data set, wherein the recurrent machine-learning model is configured to take as inputs a state and an action and predict a cumulative reward over a plurality of time periods, wherein the state and the action are associated with a user and a time period]. For example, the third record 285 of state table 275 has a value field 281 which indicates if a user had interacted with resource items with resource IDs={ 1, 4, 7} ( corresponding to the state field 277 of record 285), the next action of interacting with assessment A4 ( corresponding to the action field 279 of record 285) has a value of0.63. Value field 281 may comprise a numerical metric, such that value fields 281 of particular state table records may be easily compared to one another. In the case of the FIG. 4A example, given a state 277 corre­sponding to a user having interacted with resource IDs={ 1, 7}, system 100 considers there to be relatively more value in the next action being interacting with resource ID=ll (value=0.99) than interacting with resource ID=3 (value=0. 72) [capturing the cumulative rewards associated with the sequence of actions].)
wherein the cumulative reward is recurrently defined based on at least a reward associated with the user and the time period and an application of the recurrent machine-learning model to a subsequent state and a subsequent action associated with the user and a subsequent time period relative to the time period, (Cal teaches the cumulative reward as the updated feedback metric that is defined based on the reward associated with the user and time period as the sate and action value function for estimating state action values with, as recited in [0056]-[00057]: State field 277 represents a state of a user prior to the action 279 of the current record. State field 277 may comprise references to the resource IDs of particular resource items with which users may interact… Action field 279 represents a next action. As with the action field of user log 250 (FIG. 3A), action field 279 of state table 275 includes possible actions which correspond to the infor­mation resource types ( e.g. general information resources, assessments and questions) being used by resource interface 112 of learning system 100 together with a resource ID ref­erence… As discussed above, in the illustrated embodiment, state table 275 includes at least one value field 281 which may represent the value that system 100 associates with perform­ing a corresponding action, given a corresponding state [the cumulative reward is recurrently defined based on at least a reward associated with the user]…, used to update predicted feedback outcome metric values in by the learning model as recited in [0064] In some embodiments, updating the values of the selected state table records in block 308B involves using the state table in a model of a finite Markov Decision Process (MDP) and using a reinforcement learning technique [the cumulative reward is defined based on at least an application of the recurrent machine-learning model to a subsequent state and a subsequent action associated with the user] to approximate solutions for updating the values of the selected state table records. In some embodiments, a suitable approxi­mation technique for the value function of the model involves temporal difference (TD) reinforcement learning [wherein the time period is captured as the temporal difference associated with the a reward associated with the user and  an application of the recurrent machine-learning model to a subsequent state and a subsequent action associated with the user and a subsequent time period relative to the time period using the reinforcement learning technique; i.e. wherein the cumulative reward is recurrently defined based on at least a reward associated with the user and the time period and an application of the recurrent machine-learning model to a subsequent state and a subsequent action associated with the user and a subsequent time period relative to the time period] which may be formulated to take advantage of the so-called eligibility trace A… Where A=l, all of the preceding state-action pairs in the user action log would be updated based on corresponding combinations of their existing values and the new value determined by the feedback metric of the block 306 feedback-generating action. In some embodiments, the block 308 process of updating the values of the selected state table records involves application of a reinforcement learning technique known as Q-Leaning [i.e. wherein the cumulative reward is recurrently defined based on at least a reward associated with the user and the time period and an application of the recurrent machine-learning model to a subsequent state and a subsequent action associated with the user and a subsequent time period relative to the time period]. In some embodiments, the block 308 process of updating the values of the selected state table records involves application of a rein­forcement learning process known as State-Action-Reward­State-Action ( or SARSA) Learning [i.e. wherein the cumulative reward is recurrently defined based on at least a reward associated with the user and the time period and an application of the recurrent machine-learning model to a subsequent state and a subsequent action associated with the user and a subsequent time period relative to the time period and a subsequent action associated with the user and a subsequent time period relative to the time period using the application of the reinforcement learning process known as State-Action-Reward-State-Action]. In other embodiments, a Monte Carlo method may be used in the block 308 process of updating the values of the selected state table records.)
by the computing system, determining a particular notification for prompting a recipient of the particular notification to engage in one or more predetermined activities, (claimed particular notification as resources to engage in claimed predetermined activities as learning actives, in 0027: Aspects of the invention provide methods and systems for providing information based on feedback. Feedback may be incorporated into the information provided using reinforcement learning. Information provided by the methods and systems of particular embodiments can comprise information about feedback-driven recommendations for actions in connection with accessible information resources [claimed particular notification for prompting a recipient of the particular notification to engage in one or more predetermined activities]…; information resources as content/messages to the user associated with predetermined activities via a resource repository, in 0029-0033: In the FIG. 1 embodiment, system 100 comprises a learning system and the feedback-driven information retrieved by learning system 100 comprises feedback-driven recommen­dations for user actions in relation to information resources. Such information resources may comprise educational infor­mation or content…. For example, an information repository 150 could be a topical repository 150 on the topic of astronomy, in which case it may accept contribution of information resources from a number of astronomy experts. Information resource repositories 150 described herein are merely representative examples of suitable types of informa­tion repositories 150 and, unless specifically claimed, are not meant to be limiting... Information resource repositories 150 may hold a wide variety of information resources having a corresponding wide variety of forms. By way of non-limiting example, infor­mation resources can comprise textual resources, audio resources, image-based resources, video resources, interac­tive resources, questions, assessments, executable applica­tions, instructions or directives on how to access and/or use other resources, discussion posts or forums, instructor notes, hints, biogs, any combinations or sub-combinations of these types of resources and/or the like…Resource interface 112 may pull information resources from repositories 150 and/or repositories 150 may push informa­tion resources to resource interface 112.)
by the computing system, generating, using the trained recurrent machine-learning model, a first reward estimate representing a first measure of a target user engaging in the one or more predetermined activities at a future time period occurring after a particular time period based on action representation of sending the particular notification to the target user in the particular time period; Active 58589431.1ATTORNEY DOCKETPATENT APPLICATION(claimed trained model as the model of the user current state space using claimed recurrent machine learning model as the recurrent learning techniques using feedback, in 0028: The set of accessible information resources may be referred as a state space and information about current posi­tion of a user in the state-space (e.g. a history of the informa­tion resources with which the user has interacted) may be referred to as the user's state. To move from one state to another within the state-space, a user interacts with an infor­mation resource. Such interaction of the user with an infor­mation resource may be referred to as an action. A current state of a user coupled with an action which will transition the user to a new state may be referred to as a state-action pair. The interaction of states and actions and how an action taken by a user transitions the user from one state to another state may be referred to as a model. In particular embodiments, reinforcement learning techniques may use feedback to ascribe, or otherwise determine, one or more values for state­action pairs….; where the current state space is used to generate subsequent action-state pairs at claimed future time from the current time associated with a captured action-state pair, as the claimed particular period of time based on the represented action depicted in Fig. 5D for a target user as the personalized user in the modeled group of user, in 0028: … In particular embodiments, reinforcement learning techniques may use feedback to ascribe, or otherwise determine, one or more values for state­action pairs. State-action pairs together with their corre­sponding values may be maintained in a state table. Such a state table may be used as a basis for providing information about recommended actions to a variety of users. Information provided (possibly including recommended actions) may be personalized for individual users and/or groups of users.; and as depicted in Fig. 5D and captured in the state action table for the target user as depicted in Fig. 3A

    PNG
    media_image5.png
    485
    405
    media_image5.png
    Greyscale



    PNG
    media_image6.png
    385
    715
    media_image6.png
    Greyscale

In 109: … One technique for providing personalized action recommendations involves the use of the user's current state [claimed particular time period based on action representation of sending the particular notification to the target user in the particular time period], which is reflective of the history of actions of that user in relation to accessible information resources. By way of non-limiting example, the user's current state may be used by learning system 100 in some embodiments to personalize the recommendation blocks (356 and 358) of method 350 (FIG. SB) and more particularly in connection with the illustrated embodiments of recommendation procedures 400 (FIG. 5D), 450 (FIG. SE) and 500 (FIG. SF). Each of these exemplary embodiments of methods for recommending actions may personalize recommended actions by taking into account the user's current state (or action history) when making recom­mendations.; and where updating the stat-action values involves estimating a first reward in the SARSA reinforcement learning technique, in 0064: … In some embodiments, the block 308 process of updating the values of the selected state table records involves application of a rein­forcement learning technique known as Q-Leaming. In some embodiments, the block 308 process of updating the values of the selected state table records involves application of a rein­forcement learning process known as State-Action-Reward­State-Action ( or SARSA) Learning…) 079894.473415/499,835 4 of 16
by the computing system, generating, using the trained machine-learning model, a second reward estimate representing a second measure of the target user engaging in the one or more predetermined activities at the future time period based on at least a second-action representation of not sending the particular notification to the target user in the particular time period; (claimed second reward estimate as the action value associated with a set of filtered actions for the user to engage with that does not include the claimed particular notification in a second-action that part of the user’s activity path for performing the state-action update associated a second-action representation that takes the user another activity path captured by the filtered list of values if for the set of claimed second reward estimate of the claimed second-action in the list where the score dose not including sending the claimed notification sent to the target user, as depicted in Fig. 5E and Fig. 5F, in 0083: Method 450 then proceeds to block 458 which involves an inquiry into whether any of the block 454 set of path records have corresponding actions that are in the block 354 target state (see FIG. 5B). As discussed above, the block 354 target state comprises one or more action entries. If any of these action entries of the block 354 target state correspond to the action field of the block 454 set of path records, then the block 458 inquiry is positive. Otherwise, the block 458 inquiry is negative… If the block 458 inquiry is negative (i.e. there are no block 454 path records having action entries among the actions of the block 354 target state) [a second-action representation of not sending the particular notification to the target user in the particular time period], then method 450 pro­ceeds to block 460. Block 460 involves setting aside the block 454 set of path records and generating a weighted average [claimed second reward estimate representing a second reward estimate representing a second measure of the target user engaging in the one or more predetermined activities at the future time period based on at least a second-action representation of not sending the particular notification to the target user in the particular time period] of the values for the state table records (within the block 353 subset) having an action in the block 354 target state… Once the weighted averages are calculated in block 460, method 450 proceeds to block 462 which involves selecting the action corresponding to the highest block 460 weighted average  [claimed second measure of the target user engaging in the one or more predetermined activities at the future time period based on at least a second-action representation of not sending the particular notification to the target user in the particular time period] to be the next recommended action [i.e. based on at least a second-action representation of not sending the particular notification to the target user in the particular time period]. In the case of the illustrative example set out above, block 462 would involve selecting action=7 (i.e. interact with resource ID=7) to be the next recommended action, since the weighted average for action=7 is greater than the weighted average for action=1...)
by the computing device, comparing a difference between the first reward estimate of the target user engaging in the one or more predetermined activities upon sending of the particular notification and the second reward estimate of the target user engaging in the one or more predetermined activities without sending of the particular notification to a predetermined threshold; (Cal teaches engaging the target user that is associated with the selected predetermined actives upon sending and not sending a particular notification as noted by the resource state id in the sequence of state and action pairs captured with their respective reward values at the respective observation sequence time interval as depicted in Figs. 4A and 5A-D and the use of the threshold in 0073: Once the highest-value filtered (feedback-generat­ing) record is ascertained in block 374, method 376 proceeds to block 376. Block 376 involves procuring all of the block 372 filtered state table records which have values within a threshold range of the block 374 highest value state table record. In the case of example filtered state table 372A, method 376 involves procuring all of the records having val­ues within a threshold range of the value of record 374A. The particular threshold used in block 376 may be a configurable (e.g. user configurable or system configurable) parameter of learning system 100. FIG. SC shows a set of filtered and thresholded records 376A [including comparing a difference between the first reward estimate of the target user engaging in the one or more predetermined activities upon sending of the particular notification and the second reward estimate of the target user engaging in the one or more predetermined activities without sending of the particular notification to a predetermined threshold] corresponding to a value threshold of0.40 from the highest-value record 37 4A . In this exemplary case, the value of highest-value record is 0.99 and the thresh­old is 0.40, so only records having values greater than 0.99- 0.40=0.59 or greater are admitted into the set of filtered and thresholded records 376A [i.e. by the computing device, comparing a difference between the first reward estimate of the target user engaging in the one or more predetermined activities upon sending of the particular notification and the second reward estimate of the target user engaging in the one or more predetermined activities without sending of the particular notification to a predetermined threshold;  ].)
and by the computing device, sending the particular notification to the target user based on the difference between the first reward estimate of the target user engaging in the one or more predetermined activities upon sending of the particular notification and the second reward estimate of the target user engaging in the one or more predetermined activities without sending of the particular notification  being greater than the predetermined threshold. (Cal teaches engaging the target user that is associated with the selected predetermined actives upon sending and not sending a particular notification as noted by the resource state id in the sequence of state and action pairs captured with their respective reward values at the respective observation sequence time interval as depicted in Figs. 4A and 5A-D and the use of the threshold in 0073: Once the highest-value filtered (feedback-generat­ing) record is ascertained in block 374, method 376 proceeds to block 376. Block 376 involves procuring all of the block 372 filtered state table records which have values within a threshold range of the block 374 highest value state table record. In the case of example filtered state table 372A, method 376 involves procuring all of the records having val­ues within a threshold range of the value of record 374A. The particular threshold used in block 376 may be a configurable (e.g. user configurable or system configurable) parameter of learning system 100. FIG. SC shows a set of filtered and thresholded records 376A [including i.e. and by the computing device, sending the particular notification to the target user based on the difference between the first reward estimate of the target user engaging in the one or more predetermined activities upon sending of the particular notification and the second reward estimate of the target user engaging in the one or more predetermined activities without sending of the particular notification ...] corresponding to a value threshold of0.40 from the highest-value record 37 4A. In this exemplary case, the value of highest-value record is 0.99 and the thresh­old is 0.40, so only records having values greater than 0.99- 0.40=0.59 or greater are admitted into the set of filtered and thresholded records 376A [i.e. and by the computing device, sending the particular notification to the target user based on the difference between the first reward estimate of the target user engaging in the one or more predetermined activities upon sending of the particular notification and the second reward estimate of the target user engaging in the one or more predetermined activities without sending of the particular notification  being greater than the predetermined threshold.].)
Cal teaches the learning system that trains a model using reinforcement learning techniques using State, Action, Reward, State, Action, in [0064] that are usually captured as tuples in computing applications, and depicted in Figs. 5A-E for generating notifications to the user using feedback information, in [0076]-[0081]. Cal does not expressly disclose the data set as a set of tuples.
 Tom does expressly disclose the data set as a set of tuples as recited in claim 1 limitation:
accessing a data set comprising tuples, (Tom teaches accessing data as observation tuples characterizing a state corresponding to a performed action by an agent under observation and reward in response to a performed action, in [0011] In some implementations, each piece of experience data is an experience tuple [i.e. accessing a data set comprising tuples] that includes a respective current observation characterizing a respective current state of the environment, a respective current action performed by the agent in response to the current observation, a respective next state characterizing a respective next state of the environment, and a reward received in response to the agent performing the current action)
training a recurrent machine-learning model using the data set, (Tom teaches where the observations are used to train the reinforcement learning model using the training engine, in [0030]-[0032] The reinforcement learning system 100 selects actions to be performed by a reinforcement learning agent 102 interacting with an environment 104. That is, the reinforcement learning system 100 receives observations, with each observation characterizing a respective state of the environment 104, and, in response to each observation, selects an action from a predetermined set of actions to be performed by the reinforcement learning agent 102 in response to the observation [training a recurrent machine-learning model using the data set]. In response to some or all of the actions performed by the agent 102, the reinforcement learning system 100 receives a reward. Each reward is a numeric value received from the environment 104 as a consequence of the agent performing an action, i.e., the reward will be different depending on the state that the environment 104 transitions into as a result of the agent 102 performing the action. In particular, the reinforcement learn­ing system 100 selects actions to be performed by the agent 102 using an action selection neural network 110 and a training engine 120. The action selection neural network 110 is a neural network that receives as an input an observation about the state of the environment 104 and generates as an output a respective Q value for each action, i.e., a prediction of expected return resulting from the agent 102 performing the action in response to the observation [a cumulative reward over a plurality of time periods in the observation] To allow the agent 102 to effectively interact with the environment 104, the reinforcement learning system 100 includes a training engine 120 that trains the action selection neural network 110 to determine trained values of the parameters of the action selection neural network 110.; where the reinforcement learning model training the model to make the prediction over a plurality of time periods as the temporal difference learning error, in [0039]-[0041]: … The system also maintains (in the replay memory or in a separate storage component) an expected learning progress measure for some or all of the pieces of experience data. An expected learning progress measure associated with a piece of experience data is a measure of the expected amount of progress made in the training of the neural network if the neural network is trained using the piece of experience data… the system determines an expected learning progress measure associated with an experience tuple based on a previously calculated temporal difference error for the experience tuple, i.e., the temporal difference error from the preceding time the experience tuple was used in training the neural network… the expected learning progress measure is an absolute value of the temporal difference learning error determined for the experience tuple the preceding time the experience tuple was used in training the neural network. In some implementations, the expected learning progress measure is derivative of an absolute value of a temporal difference learning error determined for the experience tuple the preceding time the experience tuple was used in training the neural network. 

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to integrate the method for observing interaction learning prediction using reinforcement learning techniques as disclosed by Tom with the method of information processing using machine learning techniques as disclosed by Cal.
One of ordinary skill in the arts would have been motivated to integrate the disclosed methods by Cal and Tom in order to enable reinforcement learning systems to respond to received observations using neural networks to predict output from a received input (Tom, [0002]-[0005]); doing so will realize the following advantages: “training data from a replay memory [that] can be selected in a way that increases the value of the selected data for training a neural network. This can, in turn, increase the speed of training of neural networks used in selecting actions to be performed by agents and reduce the amount of training data needed to effectively train those neural networks. Thus, the amount of computing resources necessary for the training of the neural networks can be reduced” (Tom, 0019).
	While Cal and Tom disclose the reinforcement learning techniques for modeling and learning sequential patterns using a Q-learning method by accounting for the temporal differences to determine a threshold. The references do expressly describe this inherent feature of the Q-learning process for accounting for claimed temporal differences as part of the Q-learning process.
	Ant does the use of the temporal differences as part of the Q-learning process. (in Sec. 2: Background: In RL, the learner’s environment is modelled as a Markov Decision Process (MDP). An MDP speciﬁes a set of states and a set of actions. At each step in time the process is in some state, and an action must be chosen; this action has the effect of changing the current state and producing a scalar reinforcement value. The reinforcement value, or reward, represents the extent to which we can consider the action to have had immediately desirable or undesirable consequences. Formally, an MDP is described by a 4-tuple; And including a first estimated reward 
    PNG
    media_image7.png
    129
    725
    media_image7.png
    Greyscale
; And a second estimated reward, as  r π , received at a future time step t and computing the discount criterion as the claimed  threshold criterion, in pg. 2: Left Col. …The goal of RL methods is to arrive, by performing actions and observing their outcomes, at a policy which maximizes the rewards accumulated over time.
Q-learning [17] is the most widely used and studied method for RL and, like most, it uses discounted value as its criterion of optimality [claimed threshold criterion]. The discounted value, or discounted return, of a policy in a state is deﬁned as the expected value     
    PNG
    media_image8.png
    93
    517
    media_image8.png
    Greyscale


    PNG
    media_image9.png
    571
    714
    media_image9.png
    Greyscale

)

It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to integrate the method for using Q-learning optimization criterion method for reinforcement learning as disclosed by Ant with the method of information processing using machine learning techniques as collectively disclosed by Cal and Tom.
One of ordinary skill in the arts would have been motivated to integrate the disclosed methods by Ant, Cal and Tom in order to enable reinforcement learning techniques for ac-accomplishing a process for learning estimate rewards that performing an action may reap over the course of time, so as to choose actions maximizing that measure (Ant, Sec. I: Introduction); doing so will help boost performance of the reinforcement  learning algorithm (Ant, Sec. I: Introduction).
While Cal teaches in combination with Tom and Ant the process for using reinforcement learning to compare differences in observations based on Q-learning state-action reward sequences for generating messages/content/notification to users in a computing environment  as elements used in reinforcement learning techniques. 
	Sten teaches reinforcement learning techniques as learning using look-up table of state action pairs as disclosed by the Cal reference in Pg. 31: Sec. 3.6: In reinforcement learning (RL), a learning agent doesn’t have a training set of correct actions but must determine them using a reward that it gets from the environment. The agent can observe the state of the environment and has a set of actions that alter the state. After every action the agent gets a reward which can be negative, positive or zero. The agent must learn a policy to achieve its goal, which can be for example to maximize cumulative rewards… In a Markov Decision Process (MDP) an agent has a set A of actions and can perceive a set S of states. At each point of time t the agent perceives the current state st and performs the action … The agent should 

The Cal, Tom, Ant, and Sten are references would have been recognized by those of ordinary skill in the art as useful for applicant’s purpose in developing information processing  and retrieval method/systems using reinforcement machine learning techniques.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to integrate the method for observing interaction learning prediction using reinforcement learning techniques as disclosed by Sten with the method of information processing using machine learning techniques as disclosed by Cal, Tom, and Ant.
One of ordinary skill in the arts would have been motivated to integrate the disclosed methods by Sten, Cal, Tom, and Ant. in order to enable data analysis agent by using observations including the state of the environment and a set of actions that alter the state to learn a policy to achieve its goal, which can be for example to maximize cumulative reward; doing so provides an intelligent recommendation method  for using reinforcement learning algorithms that guaranteed to find the optimal value when choosing a strategy based on state-action pairs observations.  (Sten, pg. 33: 1st para)
Regarding claims 13 and 17, the claim limitations are similar to claim 1 limitations and are therefore rejected under the same rationale.
.

Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure as listed below:
Hausknecht et al. (NPL: “Deep Recurrent Q-Learning for Partially Observable MDPs”) : teaches the combination as a recurrent machine-learning model. (in Pg. 30: Sec. Partial Observability: Partially Observable Markov Decision Process (POMDP) better captures the dynamics of many realworld environments by explicitly acknowledging that the sensations received by the agent are only partial glimpses of the underlying system state. Formally a POMDP can be described as a 6-tuple (S,A,P,R,⌦,O). S,A,P,R are the states, actions, transitions, and rewards as before, except… This observation is generated from the underlying system state according to the probability distribution o ⇠ O(s). Vanilla Deep Q-Learning has no explicit mechanisms for deciphering the underlying state of the POMDP and is only effective if the observations are reflective of underlying system states. In the general case, estimating a Q-value from an observation can be arbitrarily bad since Q(o, a|✓) 6= Q(s, a|✓). Our experiments show that adding recurrency to Deep QLearning allows the Q-network network to better estimate the underlying system state, narrowing the gap between Q(o, a|✓) and Q(s, a|✓). Stated differently, recurrent deep Q-networks can better approximate actual Q-values from sequences of observations, leading to better policies in partially observed environments) It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention 
Volodymyr et al. (US Pub. No. 20170140270): teaches the use of Q-learning and reinforcement learning of observations and selected actions are based on predetermined criteria using a predetermined threshold. For example, the criteria can be the same as the criteria described above with reference to the worker determining updated 1parameter values using the accumulated gradients (step 216) and writes the updated parameter values to the shared memory, but with the specified threshold value or the specified number being greater than the value or number used for the determination of whether to update the parameters stored in the shared memory. Thus, the worker synchronizes the values of the parameters of the target network less frequently than the worker updates the parameter values stored in the shared memory, so that the target network and the Q network will often have different parameter values
Gilad-Barach et al. (US Pub. No. 2015/0140527): teaches the recurrent machine learning model that is trained, Fig.1:108 to generate notification to the user as interventions for the target user to interact with as depicted in Fig. 1, in [0036] To begin with, the computer system 102 includes a model generating module 104 that is configured to generate a model 106 based on training data maintained in a data store 108. The model generating module 104 may use any machine learning technique to generate the model 106, such as a tech­nique selected from the domain of reinforcement learning. In a yet more particular implementation, the model 106 that is produced represents the task of selecting interventions as a contextual multi-arm bandit problem (to be described in greater detail in SubsectionA.3
Chen et al. (US Pub No. 2017 /0286860) teaches the computing system for processing information using reinforcement learning in networked environments.

Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Michael Huntley can be reached on (303) 297-4307.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/O.O.A./Examiner, Art Unit 2126    
                                                                                                                                                                                                      
                                                                                                                                                                                                                                                                                                                                                                                                       
/MICHAEL J HUNTLEY/Supervisory Patent Examiner, Art Unit 2129