DETAILED ACTION
This action is in response to arguments filed 24 October 2022 and the applicant initiated interview on 6 December 2022 for application 16/933361 filed on 20 July 2020. Currently claims 1-20. It is noted that proposed claim amendments were filed with the arguments and discussed at the interview; these claim amendments are not examined in the current office action because they were not entered. 

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant’s arguments with respect to claims 1-20  have been considered but are moot because they are directed only to the proposed amendments discussed in the 6 December 2022 which have not been entered and therefore are not being examined herein. 

Claim Objections
Claim 17 is objected to because of the following informalities:  Claim 17 does not end in a period;  each claim must begin with a capital letter and end with a period. 
Appropriate correction is required.




Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claim 8 is rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim 8 recites the limitation "the embedding layer" in line 3.  There is insufficient antecedent basis for this limitation in the claim.

  Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-20  are rejected under 35 U.S.C. 101. because the claims are directed to an abstract idea; and because the claims as a whole, considering all claim elements both individually and in combination, do not amount to significantly more than the abstract idea, see Alice Corporation Pty. Ltd. v. CLS Bank International, et al, 573 U.S. (2014). In determining whether the claims are subject matter eligible, the Examiner applies the 2019 USPTO Patent Eligibility Guidelines. (2019 Revised Patent Subject Matter Eligibility Guidance, 84 Fed. Reg. 50, Jan. 7, 2019.)
Step 1: Is the claim to a process, machine, manufacture, or composition of matter? Yes—claim 1 recites a method which is a process. Claims 13 and 19 recite a machine/device/system and product respectively. 
Step 2A, prong one: Does claim 1 recite an abstract idea, law of nature or natural phenomenon? Yes—the limitations of “7generating, …, for each event of the sequence of events, a vector 8representation for the event, the vector representation for the event including a representation of 9the event type for the event and a representation of the time information for the event,” and “10generating, … and based upon the vector representations for the 11sequence of events, a prediction of a next event to occur after the sequence of events …, wherein the prediction of the next event includes a predicted event type for the 13next event and a predicted time indicative of when the next event will occur,” as drafted, are mathematical steps for forming a vector representation of information contained in a temporal sequence of event for computing a prediction. In addition, the limitations of “10generating, … a 12clustering result, …wherein the 14predicted event type is selected from the set of event types,” and “wherein the clustering result 15comprises information resulting from clustering the sequence of events into a plurality of 16clusters” as drafted, are mental steps for selecting an event type (act of judgement, observation, evaluation) and for clustering/assigning an event to a cluster (act of judgement, evaluation).  
Step 2A, prong two: Does the claim recite additional elements that integrate the judicial exception into a practical application? No—the judicial exception is not integrated into a practical application. Although the claim recites elements that include  “computer-implemented method” and “implemented using instructions executed by one or more computer systems,” each of the computer, systems, and instructions is recited at a high-level of generality such that it amounts to no more than a mere instructions to apply the exception using a generic computer component.  Although the claim recites the additional functionality  “2providing information for a sequence of events as input to a …, the 3information for the sequence of events including, for each event, information identifying an 4event type for the event and time information for the event indicative of when the event 5occurred, wherein the event type is selected from a set of event types,” this functionality consists of mere data gathering steps of receiving a dataset and functions performed by generic computing resources that are recited at a high level of generality that does not impose a meaningful limit on the judicial exception and does not integrate the mental steps into a practical application (See MPEP 2106.06.05(g)). The “neural network” is also recited at a high level of generality and merely generally link to respective technological environments (neural networks) and therefore likewise amounts to no more than a mere instructions to apply the exception using generic computer components and is insufficient to integrate the mental and mathematical steps into a practical application. 
Step 2B: Does the claim recite additional elements that amount to significantly more than the judicial exception? No— The recitation in the preamble is insufficient to transform a judicial exception to a patentable invention because the preamble elements are recited at a high level of generality that simply links to a field of use, see MPEP 2106.05(h). Likewise, the “neural network” is recited at a high level of generality that simply links to a field of use and the claimed extra-solution of data gathering (i.e., providing sequential event, event type, time information) is acknowledged to be well-understood, routine, conventional activity (see, e.g., court recognized WURC examples in MPEP 2106.05(d)(II)(i)). The claim thus recites computing components only at a high-level of generality such that it amounts to no more than mere instructions to apply the exception using generic computer components. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept.
Taken alone, their additional elements do not amount to significantly more than the above- identified judicial exception (the abstract idea). Looking at the limitations as an ordered combination adds nothing that is not already present when looking at the elements taken individually. There is no indication that the combination of elements improves the functioning of a computer or improves any other technology. Their collective functions merely provide conventional computer implementation.
For the reasons above, claim 1 is rejected as being directed to non-patentable subject matter under §101. This rejection applies equally to independent claims 13 and 19, which recite a system and computer product, respectively, to perform mental and mathematical steps. It is noted that claims 13 and 19 additionally recite additional elements in the following limitations that are recited at a high-level of generality such that amounts to no more than a mere instructions to apply the exception using a generic computer component (Step 2A, Prong 2) and that are insufficient to transform a judicial exception to a patentable invention because the recited elements are considered insignificant extra-solution activity, see MPEP 2106.05(g): processors/processing devices (claims 13 and 19), computer-readable medium (claims 13 and 19), and  program code with instructions (claims 13 and 19).
As to dependent claims (2-12), (14-18), and 20 which depend from claims 1, 13, and 19, respectively, claims 2, 4, 6-10, 14, 16, 18, and 20 recite additional limitations that fall under Step2A prong 1 as mathematical steps as follows: 
Claims 2, 14, and 20: “training … using a plurality of loss functions, the plurality of loss 4functions including: 5at least one loss function directed to predicting an event type for an event to occur 6after a sequence of events; 7at least one loss function directed to predicting a time of occurrence for the event 8to occur after the sequence of events; and 9at least one loss function directed to clustering the sequence of events.”  (training a model (parameter computation) using loss functions associated with prediction of events or time and with clustering)
Claims 4 and 16: “for at least one event in the sequence of events, converting the time information for the at 3least one event to a time gap information indicating a length of time between occurrence of the at 4least one event and occurrence of an event adjacent to the at least one event in the sequence of 5events and occurring before the at least one event”  (mathematical step of converting a sequence of time points to a sequence of time-gaps)
Claims 6 and 18: “wherein generating the prediction of the 2next event and the clustering result comprises: 3using, …, a hidden state ….”  (mathematical step of using a hidden state of a model to predict an event and perform clustering)
Claim 7: “wherein the clustering result comprises, 2for each event in the sequence of events, a cluster affinity distribution to one or more of a set of 3clusters.”  (mathematical step of determining a distribution over a set of outcomes/clusters)
Claim 8: “wherein generating, for each event of the 2sequence, the vector representation comprises: 3encoding, using an embedding matrix of the embedding layer, each of the sequence of  events into a first dimensional space to generate the vector representation; and US2008 17151254 1Docket No. 058083/1127471 (P8296) PATENT APPLICATION 5embedding, using the embedding layer, the representation of time information into a 6second dimensional space using an embedding weight matrix and a logarithmic transformation 7function.”  (mathematical step of encoding event sequence information using embedding matrices and a logarithmic transformation function) 
Claim 9: “wherein generating, for each event of the 2sequence, the vector representation comprises: 3generating, …, a first representation based upon the event type for the 4event; 5generating, …, a second representation based upon the time 6information for the event; and 7generating, …, the vector representation based upon the first 8representation and the second representation.”  (mathematical step of forming vector representation for temporal point sequence using separate temporal and event vector representations)
Claim 10: “wherein 2generating, for each event of the sequence of events, the vector representation comprises 3generating the vector representation using a first set of one or more layers …; 4generating the prediction of the next event and the clustering result comprises using a 5second set of one or more layers …, wherein the second set of layers is 6different from the first set of layers.” (mathematical step of forming vector representation for temporal point sequence and clustering using distinct layers in a model)
Moreover, dependent claims (2, 3, 5, 6, 9, 10, 11, 12),  (14-18),  and 20 which depend from claims 1, 13, and 19 respectively, recite additional elements to be addressed at Step 2A, Prong 2 and at Step 2B as follows:  the claim elements “wherein the time information for each 2event in the sequence of events comprises a time stamp indicative of a time of occurrence of the 3event” (claims 3 and 15) and “for at least one event in the sequence of events, the time information for the at least one 3event specifies a time gap information indicating a length of time between occurrence of the at 4least one event and occurrence of an event adjacent to the at least one event in the sequence of 5events and occurring before the at least one event” (claims 5 and 17) recite more details on the data in the data gathering steps (text elements, documents, voice dictation, email, feeds that form received information)  that amount to no more than  mere instructions to apply the exception using a generic computer component, do not integrate the mental and mental steps into a practical application (Step 2A, Prong 2), and also do not impose a meaningful limit on the judicial exception (see MPEP 2106.05(d)(II)(i)). In addition, the claim elements “program code” (claims 14, 16, 20), “processors” (claims 14, 16, 20), “instructions” (claims 14, 16)  recite generic computer and processing resources that are also considered insignificant extra-solution activity (MPEP 2106.05(g)) at Step 2B. Moreover, the claim elements “neural network” (claims 2, 6, 9, 10, 14, 18, 20), “hidden state of the neural network” (claims 6, 18), “long term short term memory ("LSTM") network” (claim 11- relative to second set of layers), and “wherein the neural network comprises 2one or more of a long term short term memory ("LSTM") network, a gated recurrent unit 3("GRU") network, a variational recurrent neural network ("VRNN"), or a mixture density 4network ("MDN")” (claim 12) recite neural network and various specific neural network variant models that perform the mathematical and mental steps of prediction and clustering; these elements are recited at a high level of generality that simply links to a field of use (machine learning using neural networks) and merely links the judicial exception to a particular technological environment (machine learning) using generic computing components and therefore also does not impose a meaningful limitation on the judicial exception. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. It is further noted that the training of a RNN neural network using multiple loss functions for prediction of an event and the time to the event with event sequence clustering using a hidden state of that neural network that represents, over multiple layers of the neural network, event and time embeddings as well as clustering is well-known and understood (for example, see Wu et al. (“Reinforcement Learning with Policy Mixture Model for Temporal Point Processes Clustering”, https://arxiv.org/abs/1905.12345, arXiv:1905.12345v3 [cs.LG] 28 Jun 2019, pp. 1-15) with respect to ([Abstract, p. 2, Section 1, p. 3, Section 3.2, pp. 3-4, Section 3.1, p. 5, Section 3.3.1] The flexibility of our model lies in: i) all the components are networks including the policy network for modeling the intensity function of temporal point process; ii) to handle varying-length event sequences, we resort to inverse reinforcement learning by decomposing the observed sequence into states (RNN hidden embedding of history) and actions (time interval to next event) in order to learn the reward function, thus achieving better performance or increasing efficiency compared to existing methods using rewards over the entire sequence such as log-likelihood or Wasserstein distance., 1) We present a network based EM framework for TPP clustering, differing from previous work using parametric clustering models [34]. Under the EM scheme, clustering of the entire dataset and model fitting of each cluster are jointly performed rather than separated in two steps [9]., In practical application, hq is a 3-layer classifier including sequence embedding layer, RNN layer and classification layer as used in [4, 33]., Suppose the latents Y are sampled from an arbitrary valid probability distribution q(Y ), then a lower bound F(q, θ) of the marginal log likelihood L(θ; X) can be obtained by Jensen’s inequality as: F(q, θ) = L(θ; X) − DKL(q||p), (1) which means we have ∀q(Y ) : L(θ; X) ≥ F(q, θ) 2 . Given randomly initialized parameter θ (0) and arbitrary distribution q (0), we iteratively update q (k) and parameter θ (k) by the following Expectation Maximization procedure: E-step: given model parameter θ (k) , update q (k) to q (k+1) by matching q to posterior p(Y |X, θ(k) ) M-step: given q (k+1), update θ: θ (k+1) ← θ (k) + ∇F(q (k+1), θ) by maximizing F(q (k+1), θ), We adopt RNN with stochastic neurons [2] as the policy network. Here action refers to the time to next event from current event timestamp and state refers to the hidden embedding of RNN for the history.).
In summary, as shown in the analysis above, claims 1-20  do not provide any additional elements that when considered individually or as an ordered combination, amount to significantly more than the abstract idea identified. Therefore, as a whole, claims 1-20  do not recite what have the courts have identified as "significantly more”. In particular, there is no indication that the combination of elements improves the functioning of a computer or improves another technology when claims are considered individually or as an ordered combination.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1-20   are rejected under 35 U.S.C. 103 as being unpatentable over Du et al. (“Recurrent Marked Temporal Point Processes Embedding Event History to Vector”, KDD’16, 2016, pp. 1555-1564), hereinafter referred to as Du, in view of Wu et al. (“Reinforcement Learning with Policy Mixture Model for Temporal Point Processes Clustering”, https://arxiv.org/abs/1905.12345, arXiv:1905.12345v3 [cs.LG] 28 Jun 2019, pp. 1-15), hereinafter referred to as Wu.

In regards to claim 1, Du teaches A computer-implemented method comprising: 2providing information for a sequence of events as input to a neural network, the 3information for the sequence of events including, for each event, information identifying an 4event type for the event and time information for the event indicative of when the event 5occurred, wherein the event type is selected from a set of event types, and wherein the neural 6network is implemented using instructions executed by one or more computer systems;(See at [p. 1556, Section 2] The input data is a set of sequences C =  S 1 , S 2 , . . . . Each S i = (t i 1, yi 1),(t i 2, yi 2), . . . is a sequence of pairs (t i j , yi j ) where t i j is the time when the event of type (or marker) y i j has occurred to the entity i, and t i j < ti j+1. Depending on specific applications, the entity and the event type can have different meanings. For example, in transportation, S i can be a trace of time and location pairs for a taxi i where t i j is the time when the taxi picks up or drops off customers in the neighborhood y i j ., Further at [p. 1559, Section 5.2, Figure 8] In our algorithm framework1 , we need both sparse (the marker yj ) and dense features at time tj . Meanwhile, the output is also mixed of discrete markers and real-value time, which is then fed into different loss functions including the cross-entropy of the next predicted marker and the negative log-likelihood of the next predicted event timing. Therefore, we build an efficient and flexible platform2 particularly optimized for training general directed acyclic structured computational graph (DAG). The backend is supported via CUDA and MKL for GPU and CPU platform, respective., wherein a neural network event prediction framework processes a sequence of events y_k^i which characterize the event type (e.g., that corresponds to an entity or a scenario attribute such as a taxi/entity dropping off a customer at a particular neighborhood), wherein this information also includes temporal information (t_k^i) corresponding to the occurrence of the event and wherein this neural network-based framework is implemented using algorithms/code/instructions to generate testing results on specific datasets (e.g., Figure 8).) 7generating, by the neural network, for each event of the sequence of events, a vector 8representation for the event, the vector representation for the event including a representation of 9the event type for the event and a representation of the time information for the event; (See at [p. 1558, Section 5.1, Figure 2, Figure 3], Our key idea is to let the RNN (or its modern variant LSTM [23], GRU [5], etc.) model the nonlinear dependency over both of the markers and the timings from past events. As shown in Figure 2, for the event occurring at the time tj of type yj , the pair (tj , yj ) is fed as the input into a recurrent neural network unfolded up to the j + 1-th event. The embedding hj−1 represents the memory of the influence from the timings and the markers of past events. … At the j-th event, the input layer first projects the sparse one-hot vector representation of the marker yj into a latent space…, wherein embeddings (hidden states) are generated by a (RNN-based) neural network for each event in the temporal sequence of events such that the hidden states that are formed from the resultant projection of both the vector representation of the events y into a latent/embedding space (Figure 3) and the projection of the timing also onto a latent/hidden space (of the RNN with a concomitant additional transformation show in Figure 3 that projects the embedding space for the event representation into V^y and the time representation into v^t)) such that the representation of each respective term in  the temporal sequence input is a vector in a respective latent space.) and 10generating, by the neural network and based upon the vector representations for the 11sequence of events, a prediction of a next event to occur after the sequence of events …, wherein the prediction of the next event includes a predicted event type for the 13next event and a predicted time indicative of when the next event will occur, wherein the 14predicted event type is selected from the set of event types, ….  (See at [p. 1557, Section 4], Given the history of past events, we can explicitly specify the conditional density function that the next event will happen at time t with type y as f ∗ (t, y) = f(t, y|Ht) where f ∗ (t, y) emphasizes that this density is conditional on the history., Further at [p. 1558, Section 5.1], Since now hj represents the influence of the history up to the j-th event, the conditional density for the next event timing can be naturally represented as <equation 8> … As a consequence, we can depend on hj to make predictions to the timing tˆj+1 and the type yˆj+1 of the next event., wherein the (RNN-based) neural network framework predicts both a predicted marker/event (type) as well as a predicted time for that event (based on a conditional distribution derived from the latent space/embedding representation on both the event and time of the event).)
However, Du does not explicitly teach and a 12clustering result, … wherein the clustering result 15comprises information resulting from clustering the sequence of events into a plurality of 16clusters.  In other words, Du does not disclose that a clustering operation (such as may be applied to the latent space representation of event and time) occurs as a predicate to the determination of the predicted time and event.
However, Wu, in the analogous art of using neural networks to model temporal point processes, teaches generating, by the neural network, for each event of the sequence of events, a vector 8representation for the event, …  and a 12clustering result, wherein the prediction of the next event includes a predicted event type for the 13next event and a predicted time indicative of when the next event will occur, wherein the 14predicted event type is selected from the set of event types, and wherein the clustering result 15comprises information resulting from clustering the sequence of events into a plurality of 16clusters (Further at [Abstract], The purpose is to cluster the sequences with different temporal patterns into the underlying policies while learning each of the policy model. …we resort to inverse reinforcement learning by decomposing the observed sequence into states (RNN hidden embedding of history) and actions (time interval to next event) in order to learn the reward function…., Further at [p. 3, Section 3.1], Given a temporal event set X with M observed event sequences: X = {x1, x2, . . . , xM} and the discrete latents i.e. cluster labels Y = {y1, y2, . . . , yM} for yi ∈ {1, 2, . . . , N}, we suppose that X are generated by a mixture of N experts with a latent policy for each expert, parameterized by θ as a whole., Further at [p. 5, Section 3.2], Therefore, the E-step involves line 4, 5, 6 in Alg. 1. In practical application, hq is a 3-layer classifier including sequence embedding layer, RNN layer and classification layer as used in [4, 33]., Further at [p. 5, Section 3.3], Given the hidden variable Y estimated by the classifier in E-step, each event sequence x in the training dataset D is classified to a specific policy with discrete hidden variable yi ., Further at [p. 6, Section 3.3.1], So far the RNN policy network with stochastic neurons is able to mimic the event generating mechanism of stochastic temporal point process by Eq. 4. Given a sequence of past events st = {ti}ti the next event time is generated as ti+1 = ti + a, with the inter-event time a sampled from stochastic policy πθ(a|st) as the action., wherein, for temporal point process sequences, a (RNN-based) neural network framework generates embedding representations (noted as including a generation based on the architecture of Du – reference 4 in Wu) for each event in those sequences and clusters the embedding representations over a specified set of clusters (each corresponding to a different time-gap prediction policy learned based on the learned representation of the sequence data) and wherein both this clustering result (which is used to predict a policy) and the embedding representations are used to predict the time of a next event.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Du to incorporate the teachings of Wu to generating, by the neural network, for each event of the sequence of events, a vector 8representation for the event, the vector representation for the event including a representation of 9the event type for the event and a representation of the time information for the event;  and a 12clustering result, wherein the prediction of the next event includes a predicted event type for the 13next event and a predicted time indicative of when the next event will occur, wherein the 14predicted event type is selected from the set of event types, and wherein the clustering result 15comprises information resulting from clustering the sequence of events into a plurality of 16clusters. The modification would have been obvious because one of ordinary skill would have been motivated to improve the efficacy and accuracy of neural network-based temporal point process learning for time-to-event prediction in temporal sequences by jointly learning the clustering of the hidden state representation of those sequences to more effectively associate particular sequences with particular time-gap prediction policies (Wu, [Abstract, pp. 1-2, Section 1, Table 2, Table 3]). 

In regards to claim 2, the rejection of claim 1 is incorporated and Du further teaches, Iprior to providing 2information for a sequence of events as input to a neural network: 3training the neural network using a plurality of loss functions, the plurality of loss 4functions including: 5at least one loss function directed to predicting an event type for an event to occur 6after a sequence of events; 7at least one loss function directed to predicting a time of occurrence for the event 8to occur after the sequence of events; ….  (See at [p. 1559, Section 5.2, Figure 2, Figure 3] Given a collection of sequences C =  S i , where S i = (t i j , yi j ) ni j=1 , we can learn the model by maximizing the joint log-likelihood of observing C, 
    PNG
    media_image1.png
    31
    351
    media_image1.png
    Greyscale
,  wherein the RNN-based temporal point process learning model is trained (i.e., undergoes parameter learning prior to evaluation/testing) using the objective function shown in equation 14 which includes a component associated with the event learning/prediction (first term on RHS of equation 14) and a second component associated with the time (gap) learning/prediction (second term on RHS of equation 14 where d_j corresponds to time gap information as shown in Figure 2).
However, Du does not explicitly teach and 9at least one loss function directed to clustering the sequence of events.  In other words, Du does not disclose that a clustering operation (such as may be applied to the latent space representation of event and time) occurs as a predicate to the determination of the predicted time and event (and which is learned jointly with the latent space representation).
However, Wu, in the analogous art of using neural networks to model temporal point processes, teaches prior to providing 2information for a sequence of events as input to a neural network: 3training the neural network using a plurality of loss functions, the plurality of loss 4functions including: 5at least one loss function directed to predicting an event type for an event to occur 6after a sequence of events; 7at least one loss function directed to predicting a time of occurrence for the event 8to occur after the sequence of events; and 9at least one loss function directed to clustering the sequence of events. (See at [p. 2, Section 1] Under the EM scheme, clustering of the entire dataset and model fitting of each cluster are jointly performed rather than separated in two steps [9].,Further at [pp. 3-4, Section 3.1] Given a temporal event set X with M observed event sequences: X = {x1, x2, . . . , xM} and the discrete latents i.e. cluster labels Y = {y1, y2, . . . , yM} for yi ∈ {1, 2, . . . , N}, we suppose that X are generated by a mixture of N experts with a latent policy for each expert, parameterized by θ as a whole. The log likelihood is: L(θ; X, Y ) = log p(X, Y |θ),… Suppose the latents Y are sampled from an arbitrary valid probability distribution q(Y ), then a lower bound F(q, θ) of the marginal log likelihood L(θ; X) can be obtained by Jensen’s inequality as: F(q, θ) = L(θ; X) − DKL(q||p), (1) which means we have ∀q(Y ) : L(θ; X) ≥ F(q, θ) 2 . Given randomly initialized parameter θ (0) and arbitrary distribution q (0), we iteratively update q (k) and parameter θ (k) by the following Expectation Maximization procedure:…, wherein, both the neural network embeddings and clustering across those embeddings are jointly learned using an EM optimization technique based on a composite loss function which includes a loss function for observing an output embedding (Y) given input (X, interpreted as including time and event information) but also which includes the KL metric corresponding to a cluster-specific loss function which optimizes the predicted distribution (over hypothesis space/clusters) of generated (event/time) embedding representations relative to the corresponding posterior distribution computed based model parameters updated in a previous EM step.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Du to incorporate the teachings of Wu to, prior to providing 2information for a sequence of events as input to a neural network, 3train the neural network using a plurality of loss functions, the plurality of loss 4functions including: 5at least one loss function directed to predicting an event type for an event to occur 6after a sequence of events; 7at least one loss function directed to predicting a time of occurrence for the event 8to occur after the sequence of events; and 9at least one loss function directed to clustering the sequence of events. The modification would have been obvious because one of ordinary skill would have been motivated to improve the efficacy and accuracy of neural network-based temporal point process learning for time-to-event prediction in temporal sequences by jointly learning the clustering of the hidden state representation and the hidden state representations themselves of those sequences to more effectively associate particular sequences with particular time-gap prediction policies (Wu, [Abstract, pp. 1-2, Section 1, Table 2, Table 3]). 

In regards to claim 3, the rejection of claim 1 is incorporated and Du further teaches, 1wherein the time information for each 2event in the sequence of events comprises a time stamp indicative of a time of occurrence of the 3event.  (See at [p. 1556, Section 2, Figure 2, Figure 3] The input data is a set of sequences C =  S 1 , S 2 , . . . . Each S i = (t i 1, yi 1),(t i 2, yi 2), . . . is a sequence of pairs (t i j , yi j ) where t i j is the time when the event of type (or marker) y i j has occurred to the entity i, and t i j < ti j+1. Depending on specific applications, the entity and the event type can have different meanings. For example, in transportation, S i can be a trace of time and location pairs for a taxi i where t i j is the time when the taxi picks up or drops off customers in the neighborhood y i j ., wherein the time, which is input into the neural network, specifies/corresponds to a particular time for an event (i.e., it is a time stamp).) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Du to incorporate the teachings of Wu for the same reasons as pointed out for claim 1.

In regards to claim 4, the rejection of claim 3 is incorporated and Du further teaches, 12for at least one event in the sequence of events, converting the time information for the at 3least one event to a time gap information indicating a length of time between occurrence of the at 4least one event and occurrence of an event adjacent to the at least one event in the sequence of 5events and occurring before the at least one event.  (See at [pp. 1558-1559, Section 5.1, Figure 2, Figure 3] In addition, for the timing input tj , we can extract the associated temporal features tj (e.g., like the inter-event duration dj = tj − tj−1), wherein the time, which is input into the neural network, is converted (at input) to an inter-event time (i.e., a time gap) for an event (as shown, for example, in Figure 2).) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Du to incorporate the teachings of Wu for the same reasons as pointed out for claim 1.

In regards to claim 5, the rejection of claim 1 is incorporated and Du further teaches, 1 further comprising: 2for at least one event in the sequence of events, the time information for the at least one 3event specifies a time gap information indicating a length of time between occurrence of the at 4least one event and occurrence of an event adjacent to the at least one event in the sequence of 5events and occurring before the at least one event.  (See at [pp. 1558-1559, Section 5.1, Figure 2, Figure 3] In addition, for the timing input tj , we can extract the associated temporal features tj (e.g., like the inter-event duration dj = tj − tj−1), wherein the time, which is input into the neural network, includes an inter-event time (i.e., a time gap) that characterizes the time between successive/adjacent events  (as shown, for example, in Figure 2).) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Du to incorporate the teachings of Wu for the same reasons as pointed out for claim 1.

In regards to claim 6, the rejection of claim 1 is incorporated and Du further teaches, 1 1wherein generating the prediction of the 2next event and the clustering result comprises: 3using, by the neural network, a hidden state of the neural network. (See at [p. 1558, Section 5.1, Figure 2, Figure 3] Our key idea is to let the RNN (or its modern variant LSTM [23], GRU [5], etc.) model the nonlinear dependency over both of the markers and the timings from past events. As shown in Figure 2, for the event occurring at the time tj of type yj , the pair (tj , yj ) is fed as the input into a recurrent neural network unfolded up to the j + 1-th event. The embedding hj−1 represents the memory of the influence from the timings and the markers of past events. … At the j-th event, the input layer first projects the sparse one-hot vector representation of the marker yj into a latent space…, wherein, as previously pointed out, the time-to-event prediction is based upon a representation of event and time in the embedding space of an RNN corresponding to the states of hidden layers in that neural network framework (see Figures 2, 3).)  
 It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Du to incorporate the teachings of Wu for the same reasons as pointed out for claim 1.

In regards to claim 7, the rejection of claim 1 is incorporated and Du does not further teach wherein the clustering result comprises, 2for each event in the sequence of events, a cluster affinity distribution to one or more of a set of 3clusters. Du does not disclose that a clustering operation (such as may be applied to the latent space representation of event and time).
However, Wu, in the analogous art of using neural networks to model temporal point processes, teaches wherein the clustering result comprises, 2for each event in the sequence of events, a cluster affinity distribution to one or more of a set of 3clusters. (See at [p. 5, Section 3.2] As mentioned above, in E-step, we match the hidden variable distribution q to posterior distribution p(Y |X, θ(k) ), and fill in values of latent variables Y for samples in observed data X according q(Y ), so that we can re-recompute the expectation of X given θ (k) , i.e., the likelihood function L(θ (k) ). We compute the hidden variable distribution q as q (k) = arg min q∈Hk KL(q||p (k) ), (2) where we restrict the distribution of hidden variable q (k) is in a bounded hypothesis space Hk.  For mixture of policies, given parameter θ (k) and observed data X, the posterior distribution p (k) is: p(yij |xj , θ(k) ) = p(xj , yij |θ (k) ) p(xj |θ (k)) , (3) where yij = 1 if and only if xj is generated by the i-th policy. Inspired by Eq. 3, to find q (k) in Eq. 2, we train a classifier to fit the current guess of the discrete hidden variable distribution q (k) to p (k) , i.e., holding the policies parameter θ (k) fixed, train a classifier hq by data generated by learned policies. … In practical application, hq is a 3-layer classifier including sequence embedding layer, RNN layer and classification layer as used in [4, 33]., wherein, the neural network time-to-event prediction framework computes a posterior distribution p(k- kth update/iteration) that indicates a probability of the association/affinity of an (predicted) event y_ij (corresponding to a jth event) with the ith cluster (time gap prediction policy) such that this is computed for each (jth) event  in the sequence of events and wherein, alternatively, the KL divergence in equation 2 also forms an affinity distribution that associates a predicted latent space state (q) with the given posterior distribution for an event over the set of clusters/policies.) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Du to incorporate the teachings of Wu for the clustering result to comprise, 2for each event in the sequence of events, a cluster affinity distribution to one or more of a set of 3clusters. The modification would have been obvious because one of ordinary skill would have been motivated to improve the efficacy and accuracy of neural network-based temporal point process learning for time-to-event prediction in temporal sequences by jointly learning the clustering of the hidden state representation and the hidden state representations themselves of those sequences to more effectively associate particular sequences with the most likely clustered time-gap prediction policies (Wu, [Abstract, pp. 1-2, Section 1, Table 2, Table 3]). 
 
In regards to claim 8, the rejection of claim 1 is incorporated and Du further teaches, 1 1 for each event of the 2sequence, the vector representation comprises: 3encoding, using an embedding matrix of the embedding layer, each of the sequence of events into a first dimensional space to generate the vector representation; and US2008 17151254 1Docket No. 058083/1127471 (P8296) PATENT APPLICATION 5embedding, using the embedding layer, the representation of time information into a 6second dimensional space using an embedding weight matrix and a logarithmic transformation 7function.  (See at [p. 1558, Section 5.1, Figure 2, Figure 3, Equations 10 and 11] The embedding hj−1 represents the memory of the influence from the timings and the markers of past events. … At the j-th event, the input layer first projects the sparse one-hot vector representation of the marker yj into a latent space. We add an embedding layer with the weight matrix Wem to achieve a more compact and efficient representation yj = W> emyj + bem, where bem is the bias. We learn Wem and bem while we train the network., wherein embeddings (hidden states) are generated by a (RNN-based) neural network in which (as shown in Figure 3), each event (marker) in the sequence of event information is embedded using embedding matrix W_em, wherein the timing is likewise transformed into an embedding/latent space using a matrix (W^t, but also W^h)  such that time information is represented (ultimately) in a latent space v^t according to the logarithmic function logF*(t_j+1) which is applied to the time of the event and such that the event information is represented both by the embedding y_j and (ultimately) by the latent space representation V^y.) 
 It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Du to incorporate the teachings of Wu for the same reasons as pointed out for claim 1.

In regards to claim 9, the rejection of claim 1 is incorporated and Du further teaches, 1 1 1wherein generating, for each event of the 2sequence, the vector representation comprises: 3generating, by the neural network, a first representation based upon the event type for the 4event; 5generating, by the neural network, a second representation based upon the time 6information for the event; and 7generating, by the neural network, the vector representation based upon the first 8representation and the second representation.  (See at [p. 1558, Section 5.1, Figure 2, Figure 3, Equations 10 and 11] The embedding hj−1 represents the memory of the influence from the timings and the markers of past events. … At the j-th event, the input layer first projects the sparse one-hot vector representation of the marker yj into a latent space. We add an embedding layer with the weight matrix Wem to achieve a more compact and efficient representation yj = W> emyj + bem, where bem is the bias. We learn Wem and bem while we train the network., wherein embeddings (hidden states) are generated by a (RNN-based) neural network in which (as shown in Figure 3) each event (marker) in the sequence of event information is embedded (y_j, first representation) using embedding matrix W_em, wherein the timing is likewise transformed into an embedding/latent space using a matrix (W^t, but also W^h)  such that both the time information and the event information are (ultimately) represented by the latent space vectors v^t and V^y (second representations) by applying RNN model parameters (weights) to the temporal information.)
 It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Du to incorporate the teachings of Wu for the same reasons as pointed out for claim 1.

In regards to claim 10, the rejection of claim 1 is incorporated and Du further teaches, 1 1 1wherein: 2generating, for each event of the sequence of events, the vector representation comprises 3generating the vector representation using a first set of one or more layers of the neural network; 4generating the prediction of the next event … comprises using a 5second set of one or more layers of the neural network, wherein the second set of layers is 6different from the first set of layers.  (See at [p. 1558, Section 5.1, Figure 2, Figure 3, Equations 10 and 11] The embedding hj−1 represents the memory of the influence from the timings and the markers of past events. … At the j-th event, the input layer first projects the sparse one-hot vector representation of the marker yj into a latent space. We add an embedding layer with the weight matrix Wem to achieve a more compact and efficient representation yj = W> emyj + bem, where bem is the bias. We learn Wem and bem while we train the network., wherein embeddings (hidden states) are generated by a (RNN-based) neural network in which (as shown in Figure 3) each event (marker) in the sequence of event information is embedded (y_j, first representation) using embedding matrix W_em, wherein the timing is likewise transformed into an embedding/latent space using a matrix (W^t, but also W^h)  such that both the time information and the event information are (ultimately) represented by the latent space vectors v^t and V^y (second representations) by applying RNN model parameters (weights) to the temporal information and such that (as shown in Figure 3), the layers in the neural network which generate these disparate latent space/embedding representations are distinct.)
However, Du does not explicitly teach … and the clustering result …. Du does not disclose that a clustering operation (such as may be applied to the latent space representation of event and time).
However, Wu, in the analogous art of using neural networks to model temporal point processes, teaches wherein: 2generating, for each event of the sequence of events, the vector representation comprises 3generating the vector representation using a first set of one or more layers of the neural network; generating the prediction of the next event and the clustering result comprises using a 5second set of one or more layers of the neural network, wherein the second set of layers is 6different from the first set of layers. (See at [p. 5, Section 3.2] Therefore, the E-step involves line 4, 5, 6 in Alg. 1. In practical application, hq is a 3-layer classifier including sequence embedding layer, RNN layer and classification layer as used in [4, 33]., wherein, the neural network time-to-event prediction framework includes multiple layers, including a sequence embedding layer (interpreted as being used to generate an embedding representation of the input event/time information) as well as distinct other layers corresponding to the RNN and classification layers in which the latter layers associate particular event latent space representations (hidden state outputs of the RNN) to particular (policy clusters) such that this framework, as previously pointed out, specifically uses this architecture for time-to-event prediction (according to the event-specific classified cluster/policy).) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Du to incorporate the teachings of Wu to generate, for each event of the sequence of events, the vector representation comprises 3generating the vector representation using a first set of one or more layers of the neural network; and to generate the prediction of the next event and the clustering result comprises using a 5second set of one or more layers of the neural network, wherein the second set of layers is 6different from the first set of layers. The modification would have been obvious because one of ordinary skill would have been motivated to improve the efficacy and accuracy of neural network-based temporal point process learning for time-to-event prediction in temporal sequences by jointly learning the clustering of the hidden state representation and the hidden state representations themselves of those sequences to more effectively associate particular sequences with the most likely classified clustered time-gap prediction policies (Wu, [Abstract, pp. 1-2, Section 1, Table 2, Table 3]). 

In regards to claim 11, the rejection of claim 10 is incorporated and Du further teaches, 1 1 1wherein the second set of layers 2correspond to a long term short term memory ("LSTM") network. (See at [p. 1558, Section 5.1, Figure 2, Figure 3] Our key idea is to let the RNN (or its modern variant LSTM [23], GRU [5], etc.) model the nonlinear dependency over both of the markers and the timings from past events….The embedding hj−1 represents the memory of the influence from the timings and the markers of past events. … At the j-th event, the input layer first projects the sparse one-hot vector representation of the marker yj into a latent space. We add an embedding layer with the weight matrix Wem to achieve a more compact and efficient representation yj = W> emyj + bem, where bem is the bias. We learn Wem and bem while we train the network., wherein an RNN (LSTM or GRU) is used to (ultimately) represent the time information and the event information by the latent space vectors v^t and V^y (second representations) by applying RNN/LSTM/GRU model parameters (weights) to the temporal information and such that (as shown in Figure 3).)
 It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Du to incorporate the teachings of Wu for the same reasons as pointed out for claim 10.

In regards to claim 12, the rejection of claim 11 is incorporated and Du further teaches, 1 1 1wherein the neural network comprises 2one or more of a long term short term memory ("LSTM") network, a gated recurrent unit 3("GRU") network, a variational recurrent neural network ("VRNN"), or a mixture density 4network ("MDN").  (See at [p. 1558, Section 5.1, Figure 2, Figure 3] Our key idea is to let the RNN (or its modern variant LSTM [23], GRU [5], etc.) model the nonlinear dependency over both of the markers and the timings from past events….The embedding hj−1 represents the memory of the influence from the timings and the markers of past events. … At the j-th event, the input layer first projects the sparse one-hot vector representation of the marker yj into a latent space. We add an embedding layer with the weight matrix Wem to achieve a more compact and efficient representation yj = W> emyj + bem, where bem is the bias. We learn Wem and bem while we train the network., wherein an RNN (LSTM or GRU) is used to (ultimately) represent the time information and the event information by the latent space vectors v^t and V^y (second representations) by applying RNN/LSTM/GRU model parameters (weights) to the temporal information and such that (as shown in Figure 3).)
 It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Du to incorporate the teachings of Wu for the same reasons as pointed out for claim 10.

Claim 13 is also rejected because it is just a system implementation of the same subject matter of claim 1 which can be found in Du and Wu. It is noted that claim 13 also recites a processor with memory and executable instructions which are found in Du (e.g., [p. 1559, Section 5.2, Figure 8] In our algorithm framework1 , we need both sparse (the marker yj ) and dense features at time tj . …Therefore, we build an efficient and flexible platform2 particularly optimized for training general directed acyclic structured computational graph (DAG). The backend is supported via CUDA and MKL for GPU and CPU platform, respective.)

Claim 14/13 is also rejected because it is just a system implementation of the same subject matter of claim 2/1 which can be found in Du and Wu.

Claim 15/13 is also rejected because it is just a system implementation of the same subject matter of claim 3/1 which can be found in Du and Wu.

Claim 16/13 is also rejected because it is just a system implementation of the same subject matter of claim 4/1 which can be found in Du and Wu.

Claim 17/13 is also rejected because it is just a system implementation of the same subject matter of claim 5/1 which can be found in Du and Wu.

Claim 18/13 is also rejected because it is just a system implementation of the same subject matter of claim 6/1 which can be found in Du and Wu.

Claim 19 is also rejected because it is just a CRM implementation of the same subject matter of claim 1 which can be found in Du and Wu. It is noted that claim 13 also recites a processor with memory and executable instructions which are found in Du (e.g., [p. 1559, Section 5.2, Figure 8] In our algorithm framework1 , we need both sparse (the marker yj ) and dense features at time tj . …Therefore, we build an efficient and flexible platform2 particularly optimized for training general directed acyclic structured computational graph (DAG). The backend is supported via CUDA and MKL for GPU and CPU platform, respective.)

Claim 20/19 is also rejected because it is just a CRM implementation of the same subject matter of claim 2/1 which can be found in Du and Wu.
1
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Chung et al. (“Unsupervised classification of multi-omics data during cardiac remodeling using deep learning”, Methods 166, 2018, pp. 66-73) teach the usage of a LSTM-VAE (a variational recurrent neural network) for learning sequence trends in which that neural network framework performs deep convolutional embedding joint optimization and clustering.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ROBERT LEWIS KULP whose telephone number is (571)272-7983. The examiner can normally be reached M, Th, F 8-5:30; Tu 8-3.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang, can be reached on 571-270-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ROBERT LEWIS KULP/Examiner, Art Unit 2124                                                                                                                                                                                                        
/MIRANDA M HUANG/Supervisory Patent Examiner, Art Unit 2124