DETAILED ACTION 
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 2022-04-01 has been entered.  The status of the claims is as follows:
Claims 1-15 remain pending in the application.
Claim 16 is cancelled.
Claims 1, 10, and 15 are amended.
Response to Arguments
Applicant argues on Remarks Page 9 that Begleiter does not teach applying CTW to an alphabet, because Begleiter decomposes the problem into a set of binary problems.  Examiner respectfully disagrees, as Begleiter’s algorithm is still applied to a multi-alphabet, regardless of how the result is ultimately achieved.  Nevertheless, the argument is moot, as Examiner has found another reference, Tjalkens et al., that describes CTW directly applied to a multi-alphabet.  Examiner also notes that, by Applicant’s own admission, CTW is well known in the art for sequence prediction, and a reference describing CTW was enclosed by Applicant in the IDS.
Applicant argues on Remarks Pages 9-10 that “the number of words in any window would change the number of word embeddings in that window. Thus, the number of word embeddings in each window, especially those of exponentially increasing size, as instantly claimed, would result in a greater number of word embeddings in each window.”  Examiner respectfully disagrees.  Each window, regardless of the size of words actually in it, contains words that are drawn from the same fixed set of discrete classes; that is, embeddings of words in the English language.  Furthermore, Applicant states:  “Applicant finds confusion in the Examiner's annotated Figure 4 of Wang (Office Action, page 7), where the Examiner annotates, "Embedding to Abstract Alphabet ek". ek, however, are simply the words (original alphabet), not an abstract alphabet.”  Examiner clarifies that it is clear from Wang that the words are input to the CNN as vector embeddings, as stated by Wang in Page 7 “Optimization”:  “For initialization, we use Word2Vec (Mikolov et al., 2013) for the starting state of the word-embeddings (trained on the same dataset as the main task).”  Thus, the words going into Wang’s CNN are the product of Word2Vec.  Examiner also points out that it is not clear how the CNN would “understand” the semantics of a Latin script word if this were not the case.  Examiner also points out Wang Bottom of Page 6:  “The parameters of a genCNN θ consists of the parameters for CNN θnn, word-embedding θembed, and the parameters for soft-max θsoftmax.”
Applicant, in the same argument above, discloses:  “However, claim 1 includes the feature that ‘the abstract alphabet is smaller in size than the original alphabet’. In Wang, the ‘abstract alphabet’ must be the same size as the original alphabet.”  Examiner acknowledges that the argument regarding this newly amended matter is persuasive.  However, the argument is now moot, as Examiner has brought in a new reference to teach this limitation, Zaheer et al., which also replaces Campbell et al. which was previously used for Claims 5 and 6.
Applicant argues on Remarks Page 10-11 that “However, the Examiner proposes that, because of Agashe, one would be motivated to use exponentially increasing size for the history windows. This combination, however, is not compatible with the combined teachings of Zhang and Wang and, accordingly, there can be no motivation to make such a combination.  Further, the recited motivation of "saving on storage resources" would not be the case in the CNN-based system of Wang, as a larger history window would increase the number of feature maps and thus the complexity of the architecture.”  Examiner respectfully disagrees.  Wang and Agashe both propose handling histories of arbitrary length.  Wang’s teaching of identical CNNs (and thus identically sized history windows) does not mean that one of ordinary skill in the art would be dissuaded from making the improvement of Agashe of using exponentially increasing history windows, as there is no limit to inputs of a CNN, and so the CNN sizes can increase as well.  Examiner also responds that saving on resources is a valid motivation, as there may be an advantage, as history gets larger, to a small number of increasingly large CNNs, rather than a large number of constant sized CNNs.  For example, each individual CNN may have some significant constant storage and performance overhead, in which case the former configuration is preferable.  Finally, after the CNNs complete operations and finish producing an output for each historical window, there will be less storage of these outputs in the former than in the latter configuration, because there are less historical windows.
Applicant argues on Remarks Pages 11-12 that the combination of Zhong, Wang, Agashe, Begleiter, and Thorhallsson does not teach “choose at least one hyperparameter for each of the plurality of history windows to allow the system to trade off bias-variance”, because “While the reference may describe the use of hyperparameters, there would be no reason why one skilled in the art would use these hyperparameters for each of the plurality of history window sizes, as instantly claimed.”  Examiner respectfully disagrees, as the combination of Zhong, Wang, Agashe, and Begleiter teaches the plurality of history windows, with each window comprising a machine learning model.  Thorhallsson teaches “choosing at least one hyperparameter to trade off bias-variance”, wherein bias-variance tradeoff is a well-known issue in the machine learning art.  Therefore, one of ordinary skill in the art would be motivated to apply Thorhallsson to each machine learning model, wherein Zhong, Wang, Agashe, and Begleiter have established machine learning models in history windows.  Thus, Thorhallsson would be applied to a machine learning model in each history window.  One would be motivated to do so, as stated in the previous office action, in order to “capture sophisticated relationships in the data while keeping it simple to prevent noise from affecting the outcome”, as per Thorhallsson Page 1 Intro Para 2, which describes trading off bias and variance.
Applicant argues, regarding independent Claim 10, similar arguments to Claim 1, but also on Remarks Page 14:  “Further, Applicant disagrees with the overall motivation to combine six discrete references to arrive at the instant invention…impermissible hindsight reconstruction is required to arrive at the instant invention by the picking and choosing of various features over such a large number of references.”  In response to applicant's argument that the examiner's conclusion of obviousness is based upon improper hindsight reasoning, it must be recognized that any judgment on obviousness is in a sense necessarily a reconstruction based upon hindsight reasoning.  But so long as it takes into account only knowledge which was within the level of ordinary skill at the time the claimed invention was made, and does not include knowledge gleaned only from the applicant's disclosure, such a reconstruction is proper.  See In re McLaughlin, 443 F.2d 1392, 170 USPQ 209 (CCPA 1971).
Applicant argues, regarding Claim 14, similar arguments to Claim 1, but also on Remarks Page 15:  “Further, Applicant disagrees with the overall motivation to combine seven discrete references to arrive at the instant invention…impermissible hindsight reconstruction is required to arrive at the instant invention by the picking and choosing of various features over such a large number of references.”  In response to applicant's argument that the examiner's conclusion of obviousness is based upon improper hindsight reasoning, it must be recognized that any judgment on obviousness is in a sense necessarily a reconstruction based upon hindsight reasoning.  But so long as it takes into account only knowledge which was within the level of ordinary skill at the time the claimed invention was made, and does not include knowledge gleaned only from the applicant's disclosure, such a reconstruction is proper.  See In re McLaughlin, 443 F.2d 1392, 170 USPQ 209 (CCPA 1971).
Applicant argues, regarding Claim 15, on Remarks Page 16, that “includes the feature that "the set of the convolutions includes a predetermined number of convolution layers, wherein the temporal convolution determines a maximum of the resulting events and ensures the fixed set of discreet classes is independent from the size of each of the plurality of history windows". Applicant submits that the art of record does not teach or fairly suggest such a treatment of the convolutions.  Regarding this newly amended matter, Examiner respectfully disagrees, as Pan discloses a predetermined number of convolution layers in [0060]:  “The input layer inputs each sample (n*m feature matrix) to a convolutional layer for convolution” and [0062]:  “The previous step 2 and step 3 can be repeated multiple times. The previous step 2 and step 3 can be repeated multiple times. After a combination of the convolutional layer”.  Here, Pan discloses a predetermined number (1) of convolution layers.  Pan also discloses “wherein the temporal convolution determines a maximum of the resulting events”, as Pan discloses “maxpooling” in [0061]:  “After the obtained 100s feature graphs are processed by using an activation function (such as the RELU function), processed feature graphs are transferred to a pooling layer for pooling (for example, the maxpooling method can be used for pooling).”  Finally, Examiner points out that “and ensures the fixed set of discreet classes is independent from the size of each of the plurality of history windows” is non-limiting language that carries no patentable weight.  See MPEP 2111.04 (I):  “whereby clause in a method claim is not given weight when it simply expresses the intended result of a process step positively recited." Id. (quoting Minton v. Nat’l Ass’n of Securities Dealers, Inc., 336 F.3d 1373, 1381, 67 USPQ2d 1614, 1620 (Fed. Cir. 2003)).

Claim Objections
Claim 15 is objected to because of the following informalities: the claim recites “ensures the fixed set of discreet classes is independent”.  This contains a typo, as this spelling of “discreet” refers to being tactful or prudent, and is not the mathematical term “discrete”.  Examiner is interpreting this limitation as “ensures the fixed set of discrete classes is independent”  Appropriate correction is required.
Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1 and 3-7 are rejected under 35 U.S.C. 103 as being unpatentable over Zhong et. al. (“Toward a self-organizing pre-symbolic neural model representing sensorimotor primitives”; hereinafter “Zhong”) in view of Wang et. al. (“genCNN: A Convolutional Architecture for Word Sequence Prediction”; hereinafter “Wang”), Agashe et. al. (US 2014/0095412 A1; hereinafter “Agashe”), Tjalkens et. al. (“On Prediction Using Variable Order Markov Models”; hereinafter “Tjalkens”), and Zaheer et al. (“Latent LSTM Allocation: Joint Clustering and Non-Linear Dynamic Modeling of Sequential Data”; hereinafter “Zaheer”).
As per claim 1, Zhong teaches an artificial intelligence system, comprising: a computing device including at least one processor, one or more data storage devices, and a non-transitory data storage medium interfaced with the at least one processor, the non-transitory data storage medium containing instructions that, when executed cause the at least one processor to (Zhong, Page 1 Abstract, discloses:  “This model employs neural network architecture incorporating a predictive sensory module based on an RNNPB (Recurrent Neural Network with Parametric Biases) and a horizontal product model.”  Here, Zhong discloses an artificial intelligence system.  Zhong, last sentence of Page 2 Section 1, discloses:  “Therefore it is sufficient to serve as a neural model that is running in a robot CPU to explain the language development in the cortical areas.” Here, Zhong discloses a computing device including at least one processor (a robot with a CPU), which also implies one or more data storage devices, and a non-transitory data storage medium interfaced with the at least one processor, the non-transitory data storage medium containing instructions that cause the at least one processor to execute.)
save observed sensory sequence information (Zhong, Page 3 Lines 4-10, discloses:  “A ball was attached at the end of the arm so that another robot could obtain the movement by tracking the ball. Our neural network was trained and run on the other NAO, which was called the “observer”. In this way, the observer robot perceived the object movement from its vision passively, so that its network took the object’s visual features and the movements into account”.  Here, Zhong discloses observed sensory sequence information, as the “observer robot” “perceived the object movement from its vision”, wherein “perceived” is another word for observed, and “vision” is sensory information, and “movement” implies this is over time, and thus a sequence of observed images.  Zhong, Top Left Page 5 Section 2.2 also discloses “error back-propagated from a certain sensory information sequence to the PB units”.  While it is likely implied that the information is saved, Zhong explicitly recites to save it (“storing”) this in Page 9 Conclusion Para 2 Lines 4-5: “These PB units allow for storing multiple sensory sequences.”)
Zhong fails to teach that the observed sensory sequence information is saved, as an original alphabet in a plurality of history windows, the plurality of history windows being reverse chronological history windows, wherein a size of the plurality of history windows increase exponentially, wherein the plurality of history windows do not overlap; apply a function to the observed sensory sequence information in each history window, wherein the function maps the observed sensory sequence information into a fixed set of discrete classes, wherein the fixed set of discrete classes is the same fixed set of discrete classes across each of the plurality of history windows; apply a context tree weighting algorithm to an abstract alphabet resulting from the fixed set of discrete classes for each of the plurality of history windows, wherein the abstract alphabet is smaller in size than the original alphabet; and make a prediction in the original alphabet based on patterns in the abstract alphabet.
Wang teaches that the [observed sensory] sequence information is saved as an original alphabet, in a plurality of history windows, the plurality of history windows being reverse chronological history windows. (Recall above Zhong discloses the sequence information is observed sensory sequence information.  Wang, Page 2 Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history.”  Wang, Page 2 Section 2 Last Sentence, discloses “Also distinct from RNN, genCNN gains most of its processing power from the heavy-duty processing units (i.e., CNN and CNNs), which follow a bottom-up information flow and yet can adequately capture the temporal structure in word sequence with its convolutional-gating architecture.”  Wang discloses saved as an original alphabet in a plurality of history windows in Page 2 Figure 1:

    PNG
    media_image1.png
    107
    525
    media_image1.png
    Greyscale

Here, one can see in the “history” row, the English words, which are part of the “original alphabet” (the finite set of words in the English language as written in Latin characters).
Examiner’s Note:  Here, alphaCNN and betaCNN are the plurality of history windows, and they are saving word sequence information. These windows go back in time progressively further, and are thus reverse chronological history windows.)
apply a function to the observed [sensory] sequence information in each history window, wherein the function maps the observed [sensory] sequence information into a fixed set of discrete classes, wherein the fixed set of discrete classes is the same fixed set of discrete classes across each of the plurality of history windows and an abstract alphabet resulting from the fixed set of discrete classes for each of the plurality of history windows (Recall above Zhong discloses the sequence information is observed sensory sequence information.  For the following, see Wang Page 2 Figure 1 and Page 5 Figure 4:

    PNG
    media_image2.png
    820
    982
    media_image2.png
    Greyscale

Wang, Page 5 Section 3.3, discloses:  “As suggested early on in Section 2 and Figure 1, we use extra CNNs with conventional weight-sharing, named βCNN, to summarize the history out of scope of αCNN. More specifically, the output of βCNN (with the same dimension of word-embedding) is put before the first word as the input to the αCNN, as illustrated in Figure 4. Different from αCNN, βCNN is designed just to summarize the history, with weight shared across its convolution units.”  Here, Wang discloses to apply a function (“αCNN “ and “βCNN”) to the observed sequence information in each history window.  The function maps the sequence information into a fixed set of discrete classes (“word embedding”, as in “the output of βCNN (with the same dimension of word-embedding)”).  Note that the “fixed set of discrete classes here is the words of the English language as represented by “word embedding”, which is when words are represented by numeric values in a vector.  Wang Page 6 Section 4 discloses:  “The parameters of a genCNN θ consists of the parameters for CNN θnn, word-embedding θembed, and the parameters for soft-max θsoftmax.”, and on Page 7 Paragraph 1, discloses:  “We use word embedding with dimension 100” and in Para 3:  “For initialization, we use Word2Vec (Mikolov et al., 2013) for the starting state of the word-embeddings”.  There is a finite amount of words in the English language, and thus English words represented by word embeddings are a fixed set of discrete classes that do not depend on the size of the βCNN windows. The fixed set of discrete classes is the same fixed set of discrete classes across each of the plurality of history windows, as each window contains discrete classes derived from the same set of discrete classes, that is, words in the English language.  These embeddings are an abstract representation of English words (an “abstract alphabet”), as opposed to the more common original representation with Latin script, and these words ek (shown in Figure 4 above), in embedding form, are input to each βCNN (“The parameters of a genCNN θ consists of…word-embedding θembed)”.
	Wang also teaches make a prediction in the original alphabet based on patterns in the abstract alphabet (Wang, as shown above in the figure showing Fig. 1 and Fig. 5, makes a prediction in the original alphabet (“prediction:  ‘sandwich’”) based on patterns in the abstract alphabet (based on the outputs of the βCNNs being input into the αCNN)).
Zhong and Wang are analogous art because they are both in the field of machine learning.  
Zhong teaches a method to make a prediction based on historical sequence information, as stated in Zhong Page 6 Section 3.3:  “In this simulation, the obtained PB units from the previous recognition experiment were used to generate the predicted movements using the prior knowledge of a specific object.”  Wang also teaches a method of making a prediction based on historical sequence information as stated in Wang Abstract:  “Instead, we use a convolutional neural network to predict the next word with the history of words of variable length”, however Wang adds a plurality of history windows.  Zhong discloses using an RNN as stated in Zhong, Page 1 Abstract:  “This model employs neural network architecture incorporating a predictive sensory module based on an RNNPB (Recurrent Neural Network with Parametric Biases)”, and Wang discloses replacing RNN with their genCNN model as stated in Wang, Page 1 Abstract:  “Different from previous work on neural network-based language modeling and generation (e.g., RNN or LSTM), we choose not to greedily summarize the history of words as a fixed length vector. Instead, we use a convolutional neural network”.  To apply Wang’s replacement of RNN with genCNN to Zhong would result in being able to make a sensory sequence prediction based on a plurality of sensory sequence information windows.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Zhong and Wang.  One would have been motivated to do so in order to make more accurate predictions by utilizing more historical data (Wang, Abstract:  “We argue that our model can give adequate representation of the history, and therefore can naturally exploit both the short and long range dependencies”… “Our extensive experiments on text generation and n-best re-ranking in machine translation show that genCNN outperforms the state-of-the-arts with big margins”.  Wang, Page 6 Section 3.4.2 also discloses the advantages of the convolutional architecture:  “genCNN takes the ‘uncompressed’ history, therefore avoids • the difficulty in finding the representation for history (i.e., unfinished sentences), especially those end in the middle of a chunk (e.g.,“the cat sat on the”) • the damping effort in RNN when the history-summarizing hidden states are updated at each time, which renders the long term memory rather difficult.”)
The combination of Zhong and Wang fails to teach wherein a size of the plurality of history windows increase exponentially, wherein the plurality of history windows do not overlap; apply a context tree weighting algorithm to an alphabet resulting from the fixed set of discrete classes for each of the plurality of history windows, wherein the abstract alphabet is smaller in size than the original alphabet;
Agashe teaches wherein a size of the plurality of history windows increase exponentially, wherein the plurality of history windows do not overlap. (Agashe Para [0076] discloses “FIGS. 5A and 5B illustrate an example linear buffer 500 representing a time window comprising exponentially-decaying time intervals and a series of associated counters in accordance with an embodiment of the invention. The time window and counters may track the occurrence of a particular type of event involving a particular asset. In FIG. 5A, element 501 represents the most recent 1-minute time interval beginning at 12:44 and ending at 12:45, element 502 represents a previous 2-minute time interval beginning at 12:42 and ending at 12:44, element 503 represents a previous 4-minute time interval beginning at 12:38 and ending at 12:42, and element 504 represents a previous 8-minute time interval beginning at 12:30 and ending at 12:38. Each element stores a counter associated with the time interval representing the element. At 12:30, the time window was created and the linear buffer was initialized. In FIG. 5A, approximately 15 minutes have elapsed between the creation of the time window and the current time, 12:44:23. A counter associated with the element 501 has been incremented 7 times, indicating that 7 events have occurred in the time interval represented by the element 501. A counter associated with the element 502 has been incremented 34 times, indicating that 34 events have occurred in the time interval represented by the element 502. A counter associated with the element 503 has been incremented 50 times, indicating that 50 events have occurred in the time interval representing the element 503. A counter associated with the element 504 has been incremented 72 times, indicating that 72 events have occurred in the time interval representing the element 504.”)

    PNG
    media_image3.png
    191
    694
    media_image3.png
    Greyscale


Agashe and the combination of Zhong and Wang are analogous art because they are both in the field of machine learning. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Agashe with the combination of Zhong and Wang to include saving sequence information in exponentially increasing history windows.  One would have been motivated to do so in order to be able to make more accurate predictions based on more data, while saving on storage resources (Agashe, Para [0075]:  “Exponentially decaying time intervals may allow precise tracking of recent events while conserving storage capacity by passing a record of the occurrence of events to progressively larger time intervals as time progresses and the events become less recent”). 
The combination of Zhong, Wang, and Agashe fails to teach apply a context tree weighting algorithm to an abstract alphabet resulting from the fixed set of discrete classes for each of the plurality of history windows to predict a future discrete sequence.
Tjalkens teaches and apply a context tree weighting algorithm to an [abstract] alphabet [resulting from the fixed set of discrete classes for each of the plurality of history windows] (Recall above that Wang teaches an abstract alphabet resulting from the fixed set of discrete classes for each of the plurality of history windows.  Tjalkens, Page 128 Introduction, discloses:  “This context tree weighting (CTW) algorithm achieves the asymptotically optimal redundancy behavior for the class of FSMX sources…We will discuss here the compression of multi-alphabet sources.  First we extend the CTW algorithm…to the non-binary case.”  Furthermore, on Page 129, Section 2 entitled “The multi-alphabet context tree weighting algorithm”, it begins:  “Let us consider the case where the source alphabet, A, is non-binary, i.e. |A| > 2.”  Finally, on Page 133 Section 5, Tjalkens states:  “We have seen that it is possible to use the context tree weighting algorithm for multi-alphabet sources and it still shows the optimal redundancy behavior.”  Thus, Tjalkens discloses applying CTW to an alphabet (A) that is a fixed set of discrete classes (|A| classes).
Tjalkens and the combination of Zhong, Wang, and Agashe are analogous art because they are both in the field of machine learning.  
The combination of Zhong, Wang, and Agashe is a sequence prediction algorithm (recall in the motivation statement for the combination of Wang and Zhong that both of these works are concerned with sequence prediction), and Tjalkens’ multi-alphabet CTW algorithm is concerned with compression, which is known in the art as being an equivalent problem to sequence prediction.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the multi-alphabet CTW algorithm of Tjalkens with the combination of Zhong, Wang, and Agashe.  One would have been motivated to do so to improve performance of the sequence prediction (Tjalkens, Pg. 128 Abstract:  “We report several results on this problem and describe some algorithms that realize an improved and even asymptotically optimal redundancy behavior.”)
However, the combination of Zhong, Wang, Agashe, and Tjalkens fails to teach wherein the abstract alphabet is smaller in size than the original alphabet.
Zaheer teaches wherein the abstract alphabet is smaller in size than the original alphabet.  (Zaheer, Page 1 Abstract, discloses:  “In this paper, we introduce Latent LSTM Allocation (LLA) for user modeling combining hierarchical Bayesian models with LSTMs. In LLA, each user is modeled as a sequence of actions, and the model jointly groups actions into topics and learns the temporal dynamics over the topic sequence, instead of action space directly.”  Zaheer, Page 8 Section 5, discloses:  “We achieve this by shifting from modeling the temporal dynamics at the observed word level to modeling the dynamics at a higher level of abstraction: topics. As the number of topics K is much smaller than the number of words V , it can act as a knob that can trade-off accuracy vs model size.”  Here, Zaheer discloses an abstract alphabet (“topics K”) that is smaller in size than the original alphabet (“words V”)).
Zaheer and the combination of Zhong, Wang, Agashe, and Tjalkens are analogous art because they are all in the field of machine learning.  
The combination of Zhong, Wang, Agashe, and Tjalkens is a sequence prediction algorithm, and Zaheer is also concerned with sequence prediction (Zaheer Page 2 Section 3:  “Lastly and most importantly, the model should be accurate in terms of predicting future events. We show how LLA satisfies all of these requirements.”)  Therefore, it would have been obvious before the effective filing date of the claimed invention to combine Zaheer’s smaller abstract alphabet with the sequence prediction algorithm of Zhong, Wang, Agashe, and Tjalkens.  One of ordinary skill in the art would be motivated to do so in order to build a smaller more efficient model that can still capture important temporal properties (Zaheer, Page 1 Intro:  “The increase in complexity and parameters arises due to a large action space in which many of the actions have similar intent or topic… learns the temporal dynamics over the topic sequence, instead of action space directly. This leads to a model that is highly interpretable, concise, and can capture intricate dynamics” and Zaheer Page 8 Section 5:  “As the number of topics K is much smaller than the number of words V , it can act as a knob that can trade-off accuracy vs model size…Furthermore, the topics provide an informative embedding that can reveal interesting temporal relationship as shown in Figure 6 and 7 – which is a novel contribution to the best of our knowledge.”)

As per Claim 3, the combination of Zhong, Wang, Agashe, Tjalkens, and Zaheer teaches the artificial intelligence system of claim 1. Wang teaches wherein the function is a feature-wise maximum over time steps in one or more of the plurality of history windows. (Wang, Page 2 Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history.”  Wang, Page 4 Section 3.2 Line 1, discloses another embodiment in which “Previous CNNs, including those for NLP tasks (Hu et al., 2014; Kalchbrenner et al., 2014), take a straightforward convolution-pooling strategy, in which the ‘fusion’ decisions (e.g., selecting the largest one in max-pooling) are made based on the values of feature-maps”. Here, Wang is describing a plurality of history windows alphaCNN and betaCNN that comprise time steps.  Each of these history windows comprises a convolutional neural network (CNN).  These CNNs comprise a “convolution-pooling strategy” (i.e., function), such as “selecting the largest one” (i.e., maximum) “based on the values of feature-maps” (i.e. feature-wise). This function is done in each CNN, and is thus applied over time steps in the plurality of history windows). 
Wang and the combination of Zhong, Agashe, Tjalkens, and Zaheer are analogous art because they are both in the field of machine learning.  
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine this teaching of Wang with the existing combination of Zhong, Agashe, Tjalkens, Zaheer, and the primary teaching of Wang to include max-pooling (selecting the largest one) based on the values of feature maps.  One would have been motivated to do so to save time by relying on known work, as it is a “straightforward” strategy (Wang, Page 4 Section 3.2 Line 1:  “Previous CNNs, including those for NLP tasks (Hu et al., 2014; Kalchbrenner et al., 2014), take a straightforward convolution-pooling strategy, in which the ‘fusion’ decisions (e.g., selecting the largest one in max-pooling) are made based on the values of feature-maps”)).

As per Claim 4, the combination of Zhong, Wang, Agashe, Tjalkens, and Zaheer teaches the artificial intelligence system of claim 3 and sensory sequence information (see Rejection to Claim 1).  Tjalkens suggests wherein the observed [sensory] sequence information is a binary event (Tjalkens, Page 128 Abstract, discloses:  “The redundancy term is linear in the number of free parameters, which for binary sources equals the number of states in the source.” Here, Tjalkens suggests that the original CTW algorithm is for binary sources, before extending the algorithm to non-binary sources.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Tjalkens with Zhong, Wang, Agashe, and Zaheer for at least the reasons recited in Claim 1.

As per Claim 5, the combination of Zhong, Wang, Agashe, Tjalkens, and Zaheer teaches the artificial intelligence system of claim 1 as well as at least one processor, arbitrary length histories, and context tree weighting algorithm (see Rejection to Claim 1).  Zaheer teaches wherein [the instructions cause the at least one processor] to use a deep neural network classifier to map [arbitrary length] histories to a second alphabet having smaller length than the alphabet of the [arbitrary length] histories [as an input sequence for the context tree weighting algorithm]. (Zaheer, Page 1 Abstract, discloses:  “In this paper, we introduce Latent LSTM Allocation (LLA) for user modeling combining hierarchical Bayesian models with LSTMs. In LLA, each user is modeled as a sequence of actions, and the model jointly groups actions into topics and learns the temporal dynamics over the topic sequence, instead of action space directly.”  Zaheer, Page 8 Section 5, discloses:  “We achieve this by shifting from modeling the temporal dynamics at the observed word level to modeling the dynamics at a higher level of abstraction: topics. As the number of topics K is much smaller than the number of words V , it can act as a knob that can trade-off accuracy vs model size.”  Zaheer also discloses that this is used for sequence prediction in Page 2 Section 3:  “Lastly and most importantly, the model should be accurate in terms of predicting future events. We show how LLA satisfies all of these requirements.”  Thus, Zaheer discloses a system comprising a deep neural network (“LSTM”) as part of a classifier (makes a prediction of the next value of a sequence, which is a discrete set of classes), that maps histories to a second alphabet (“topics K”) having smaller length than the alphabet of the histories (“words V”).  Zaheer Page 2 Section 2.2 also states explicitly classification:  “LSTM, a type of RNN, is well suited for the task as it can learn from experience to classify, process, and predict time series when there are very long time lags of unknown size between important events.”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Zaheer with the combination of Zhong, Wang, Agashe, and Tjalkens for at least the reasons recited in Claim 1.

As per Claim 6, the combination of Zhong, Wang, Agashe, Tjalkens, and Zaheer teaches teaches the artificial intelligence system of claim 5 as well as map the histories to a second alphabet having smaller length than the alphabet of the histories and long short-term memory (LSTM) (see Rejection to Claim 5 above).  Zaheer teaches wherein a long short-term memory-based sequence to symbol method is used to map the [arbitrary length] histories to a second alphabet having smaller length than the alphabet of the [arbitrary length] histories. (Zaheer, as shown above in the rejection to Claim 5, maps histories to a smaller alphabet, and performs sequence prediction, using a LSTM.  Thus, Zaheer discloses using a long short-term memory based (“LSTM”) sequence to symbol (“sequence prediction”) method to map the histories to a second alphabet having smaller length than the alphabet of the histories. Zaheer, Page 3 end of Section 3.1 also states:  “The LSTM output represents topic proportions for document/user at time t. The LSTM input over topics can capture semantic notion of topics.”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Zaheer with the combination Zhong, Wang, Agashe, and Tjalkens for at least the reasons recited in Claim 1.

As per Claim 7, the combination of Zhong, Wang, Agashe, Tjalkens, and Zaheer teaches the artificial intelligence system of claim 1 as well as at least one processor and sensory information (see Rejection to Claim 1).  Wang teaches wherein the instructions cause [the at least one processor] to perform a temporal convolution in a deep neural network to map observed [sensory] sequence information from the plurality of history windows to symbols. (Wang, Page 2 Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history.”  Wang, Section 2, defines each of the history windows as Convolutional Neural Networks (alpha-CNN and beta-CNN), which “capture the temporal structure” of the sequence.  Wang Page 3 Figure 2 and Page 5 Figure 4 illustrate that each CNN makes a prediction of the next word (i.e., symbol)).  
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Wang with the combination of Zhong, Agashe, Tjalkens, and Zaheer for at least the reasons recited in Claim 1.

Claim 2 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Zhong, Wang, Agashe, Tjalkens, and Zaheer, further in view of Thorhallson et. al. (“Visualizing the Bias-Variance Tradeoff”; hereinafter, “Thorhallsson”).
As per Claim 2, the combination of Zhong, Wang, Agashe, Tjalkens, and Zaheer teaches the artificial intelligence system of claim 1 as well as at least one processor, and plurality of history windows (see Rejection to Claim 1).  However, the combination of Zhong, Wang, Agashe, Tjalkens, and Zaheer fails to teach wherein the instructions cause the at least one processor to choose at least one hyperparameter for each of the plurality of history windows to allow the system to trade off bias-variance.
Thorhallsson teaches wherein the instructions cause [the at least one processor] to choose at least one hyperparameter [for each of the plurality of history windows] to allow the system to trade off bias-variance.  (Thorhallsson, Page 1 Intro Para 2, discloses that “This leads to a fundamental tradeoff known as the bias-variance tradeoff which is of paramount importance for optimal choice of the hyperparameters for the learning algorithms”). 
Thorhallsson and the combination of Zhong, Wang, Agashe, Tjalkens, and Zaheer are analogous art because they are both in the field of machine learning.  
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Thorhallsson with the combination of Zhong, Wang, Agashe, Tjalkens, and Zaheer to include selecting the optimal choice of hyperparameters, for which the bias-variance tradeoff is of paramount importance.  One would have been motivated to do so to “capture sophisticated relationships in the data while keeping it simple to prevent noise from affecting the outcome” (Thorhallsson, Page 1 Intro, Paragraph 2).

Claims 8 and 9 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Zhong, Wang, Agashe, Tjalkens, and Zaheer, further in view of Pan et. al. (US 2020/0211106 A1 ; hereinafter, “Pan”).
As per Claim 8, the combination of Zhong, Wang, Agashe, Tjalkens, and Zaheer teaches the artificial intelligence system of claim 7 as well as plurality of history windows and temporal convolution (see Rejection to Claim 7).  Wang teaches time steps (Wang, Page 2 Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history.”  Wang, Page 4 Section 3.2 Line 1, discloses another embodiment in which “Previous CNNs, including those for NLP tasks (Hu et al., 2014; Kalchbrenner et al., 2014), take a straightforward convolution-pooling strategy, in which the ‘fusion’ decisions (e.g., selecting the largest one in max-pooling) are made based on the values of feature-maps”. Here, Wang is describing a plurality of history windows alphaCNN and betaCNN that comprise time steps.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Wang with the combination of Zhong, Agashe, Tjalkens, and Zaheer for at least the reasons recited in Claim 1.
However, the combination of Zhong, Wang, Agashe, Tjalkens, and Zaheer fails to teach wherein the temporal convolution includes defining each of the plurality of history windows of events as a (2^k)-by-n matrix, where 2^k is a number of time steps in each of the plurality of history windows and n is a number of events at each of the time steps, and applying a convolution that is an l-by-n matrix, where l is less than 2^k, wherein a set of the convolutions produces a new set of events.
Pan teaches wherein the [temporal] convolution includes defining each [of the plurality of history windows] of events as a (2^k)-by-n matrix, where 2^k is a number of [time steps in each of the plurality of history windows] and n is a number of events [at each of the time steps], and applying a convolution that is an l-by-n matrix, where l is less than 2^k, wherein a set of the convolutions produces a new set of events. (Pan Para [0054] discloses an “n*m feature matrix” in which one dimension “m is the quantity of sub time periods” (i.e., time steps) and the other dimension “n is the quantity of feature types” (i.e., events).  Pan Para [0060] discloses that  “The input layer inputs each sample (n*m feature matrix) to a convolutional layer for convolution.”  Pan further discloses “convolution kernel quantity and size can be specified as needed” (Examiner’s note:  a convolution kernel is a matrix) wherein “at least one of a row quantity or a column quantity of the convolution kernel can be a predetermined quantity of feature types”, which is analogous to the instant application, where “n” is the column dimension for both the 2^k-by-n matrix and the l-by-n matrix (Pan describes an “n-by-m feature matrix”, and “a size of each convolution kernel can be n*j”).  Pan further discloses that the remaining dimension of the convolutional kernel must be less than the corresponding dimension of the feature matrix (“size of each convolution kernel can be n*j, where j is a positive integer less than m”).  This is analogous to the instant application, where l must be less than 2^k.  Pan Para [0060] further states that “convolutional layer can output 100s feature graphs.” (i.e. the output is a new set of events)).
Pan and the combination of Zhong, Wang, Agashe, Tjalkens, and Zaheer are analogous art because they are both in the field of machine learning.  
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Pan with the combination of Zhong, Wang, Agashe, Tjalkens, and Zaheer to include a convolution kernel wherein at least one of a row quantity or a column quantity of the convolution kernel can be a predetermined quantity of feature types (i.e., events).  One would have been motivated to do so “because feature types (i.e. events) are not continuous, the convolution kernel does not need to be scanned in a distribution direction of each feature type” (Pan, Para [0060], Sentence 3).

As per Claim 9, the combination of Zhong, Wang, Agashe, Tjalkens, Zaheer, and Pan teaches the artificial intelligence system of claim 8 as shown above.  Wang teaches wherein the convolution is applied to each of the plurality of history windows. (Wang, Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history. Wang Section 2 describes each of the history windows as Convolutional Neural Networks (CNNs)).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Wang with the combination of Zhong, Agashe, Tjalkens, Zaheer, and Pan for at least the reasons recited in Claim 1.
 
Claims 10, 11, 12, and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Zhong in view of Wang, Agashe, Tjalkens, Zaheer, and Thorhallsson.  
As per Claim 10, Zhong teaches an artificial intelligence system, comprising: a computing device including at least one processor, one or more data storage devices, and a non-transitory data storage medium interfaced with the at least one processor, the non-transitory data storage medium containing instructions that, when executed cause the at least one processor to (Zhong, Page 1 Abstract, discloses:  “This model employs neural network architecture incorporating a predictive sensory module based on an RNNPB (Recurrent Neural Network with Parametric Biases) and a horizontal product model.”  Here, Zhong discloses an artificial intelligence system.  Zhong, Page 3 last sentence of Section 1, discloses:  “Therefore it is sufficient to serve as a neural model that is running in a robot CPU to explain the language development in the cortical areas.” Here, Zhong discloses a computing device including at least one processor (a robot with a CPU), which also implies one or more data storage devices, and a non-transitory data storage medium interfaced with the at least one processor, the non-transitory data storage medium containing instructions that cause the at least one processor to execute.)
save observed sensory sequence information (Zhong, Page 3 Lines 4-10, discloses:  “A ball was attached at the end of the arm so that another robot could obtain the movement by tracking the ball. Our neural network was trained and run on the other NAO, which was called the “observer”. In this way, the observer robot perceived the object movement from its vision passively, so that its network took the object’s visual features and the movements into account”.  Here, Zhong discloses observed sensory sequence information, as the “observer robot” “perceived the object movement from its vision”, wherein “perceived” is another word for observed, and “vision” is sensory information, and “movement” implies this is over time, and thus a sequence of observed images.  Zhong, Section 2.2 also discloses “error back-propagated from a certain sensory information sequence to the PB units”.  While it is likely implied that the information is saved, Zhong explicitly recites to save it (“storing”) this in Conclusion Para 2 Lines 4-5: “These PB units allow for storing multiple sensory sequences.”)
Zhong fails to teach that the observed sensory sequence information is saved, as an original alphabet in a plurality of history windows, the plurality of history windows being reverse chronological history windows, wherein a size of the plurality of history windows increase exponentially, wherein the plurality of history windows do not overlap; apply a function to the observed sensory sequence information in each history window, wherein the function maps the observed sensory sequence information into a fixed set of discrete classes, wherein the fixed set of discrete classes is the same fixed set of discrete classes across each of the plurality of history windows; choose at least one hyperparameter for each of the plurality of history window to allow the system to trade off bias-variance; use a deep neural network classifier to map arbitrary length histories to an abstract alphabet having smaller length than the original alphabet of the arbitrary length history windows as an input sequence for the context tree weighting algorithm; apply a context tree weighting algorithm to the abstract alphabet resulting from the fixed set of discrete classes for each of the plurality of history windows; and make a prediction in the original alphabet based on patterns in the abstract alphabet.
Wang teaches that the [observed sensory] sequence information is saved as an original alphabet, in a plurality of history windows, the plurality of history windows being reverse chronological history windows. (Recall above Zhong discloses the sequence information is observed sensory sequence information.  Wang, Page 2 Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history.”  Wang, Page 2 Section 2 Last Sentence, discloses “Also distinct from RNN, genCNN gains most of its processing power from the heavy-duty processing units (i.e., CNN and CNNs), which follow a bottom-up information flow and yet can adequately capture the temporal structure in word sequence with its convolutional-gating architecture.”  Wang discloses saved as an original alphabet in a plurality of history windows in Page 2 Figure 1:

    PNG
    media_image1.png
    107
    525
    media_image1.png
    Greyscale

Here, one can see in the “history” row, the English words, which are part of the “original alphabet” (the finite set of words in the English language as written in Latin characters).
Examiner’s Note:  Here, alphaCNN and betaCNN are the plurality of history windows, and they are saving word sequence information. These windows go back in time progressively further, and are thus reverse chronological history windows.)
apply a function to the observed [sensory] sequence information in each history window, wherein the function maps the observed [sensory] sequence information into a fixed set of discrete classes, wherein the fixed set of discrete classes is the same fixed set of discrete classes across each of the plurality of history windows and an abstract alphabet resulting from the fixed set of discrete classes for each of the plurality of history windows (Recall above Zhong discloses the sequence information is observed sensory sequence information.  For the following, see Wang Page 2 Figure 1 and Page 5 Figure 4:

    PNG
    media_image2.png
    820
    982
    media_image2.png
    Greyscale

Wang, Page 5 Section 3.3, discloses:  “As suggested early on in Section 2 and Figure 1, we use extra CNNs with conventional weight-sharing, named βCNN, to summarize the history out of scope of αCNN. More specifically, the output of βCNN (with the same dimension of word-embedding) is put before the first word as the input to the αCNN, as illustrated in Figure 4. Different from αCNN, βCNN is designed just to summarize the history, with weight shared across its convolution units.”  Here, Wang discloses to apply a function (“αCNN “ and “βCNN”) to the observed sequence information in each history window.  The function maps the sequence information into a fixed set of discrete classes (“word embedding”, as in “the output of βCNN (with the same dimension of word-embedding)”).  Note that the “fixed set of discrete classes here is the words of the English language as represented by “word embedding”, which is when words are represented by numeric values in a vector.  Wang Page 6 Section 4 discloses:  “The parameters of a genCNN θ consists of the parameters for CNN θnn, word-embedding θembed, and the parameters for soft-max θsoftmax.”, and on Page 7 Paragraph 1, discloses:  “We use word embedding with dimension 100” and in Para 3:  “For initialization, we use Word2Vec (Mikolov et al., 2013) for the starting state of the word-embeddings”.  There is a finite amount of words in the English language, and thus English words represented by word embeddings are a fixed set of discrete classes that do not depend on the size of the βCNN windows. The fixed set of discrete classes is the same fixed set of discrete classes across each of the plurality of history windows, as each window contains discrete classes derived from the same set of discrete classes, that is, words in the English language.  These embeddings are an abstract representation of English words (an “abstract alphabet”), as opposed to the more common original representation with Latin script, and these words ek (shown in Figure 4 above), in embedding form, are input to each βCNN (“The parameters of a genCNN θ consists of…word-embedding θembed)”.
	Wang also teaches make a prediction in the original alphabet based on patterns in the abstract alphabet (Wang, as shown above in the figure showing Fig. 1 and Fig. 5, makes a prediction in the original alphabet (“prediction:  ‘sandwich’”) based on patterns in the abstract alphabet (based on the outputs of the βCNNs being input into the αCNN)).
Zhong and Wang are analogous art because they are both in the field of machine learning.  
Zhong teaches a method to make a prediction based on historical sequence information, as stated in Zhong Page 6 Section 3.3:  “In this simulation, the obtained PB units from the previous recognition experiment were used to generate the predicted movements using the prior knowledge of a specific object.”  Wang also teaches a method of making a prediction based on historical sequence information as stated in Wang Abstract:  “Instead, we use a convolutional neural network to predict the next word with the history of words of variable length”, however Wang adds a plurality of history windows.  Zhong discloses using an RNN as stated in Zhong, Page 1 Abstract:  “This model employs neural network architecture incorporating a predictive sensory module based on an RNNPB (Recurrent Neural Network with Parametric Biases)”, and Wang discloses replacing RNN with their genCNN model as stated in Wang, Page 1 Abstract:  “Different from previous work on neural network-based language modeling and generation (e.g., RNN or LSTM), we choose not to greedily summarize the history of words as a fixed length vector. Instead, we use a convolutional neural network”.  To apply Wang’s replacement of RNN with genCNN to Zhong would result in being able to make a sensory sequence prediction based on a plurality of sensory sequence information windows.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Zhong and Wang.  One would have been motivated to do so in order to make more accurate predictions by utilizing more historical data (Wang, Abstract:  “We argue that our model can give adequate representation of the history, and therefore can naturally exploit both the short and long range dependencies”… “Our extensive experiments on text generation and n-best re-ranking in machine translation show that genCNN outperforms the state-of-the-arts with big margins”.  Wang, Page 6 Section 3.4.2 also discloses the advantages of the convolutional architecture:  “genCNN takes the ‘uncompressed’ history, therefore avoids • the difficulty in finding the representation for history (i.e., unfinished sentences), especially those end in the middle of a chunk (e.g.,“the cat sat on the”) • the damping effort in RNN when the history-summarizing hidden states are updated at each time, which renders the long term memory rather difficult.”)
The combination of Zhong and Wang fails to teach wherein a size of the plurality of history windows increase exponentially, wherein the plurality of history windows do not overlap; choose at least one hyperparameter for each of the plurality of history window to allow the system to trade off bias-variance; use a deep neural network classifier to map arbitrary length histories to an abstract alphabet having smaller length than the original alphabet of the arbitrary length history windows as an input sequence for the context tree weighting algorithm; apply a context tree weighting algorithm to the abstract alphabet resulting from the fixed set of discrete classes for each of the plurality of history windows.
Agashe teaches wherein a size of the plurality of history windows increase exponentially, wherein the plurality of history windows do not overlap. (Agashe Para [0076] discloses “FIGS. 5A and 5B illustrate an example linear buffer 500 representing a time window comprising exponentially-decaying time intervals and a series of associated counters in accordance with an embodiment of the invention. The time window and counters may track the occurrence of a particular type of event involving a particular asset. In FIG. 5A, element 501 represents the most recent 1-minute time interval beginning at 12:44 and ending at 12:45, element 502 represents a previous 2-minute time interval beginning at 12:42 and ending at 12:44, element 503 represents a previous 4-minute time interval beginning at 12:38 and ending at 12:42, and element 504 represents a previous 8-minute time interval beginning at 12:30 and ending at 12:38. Each element stores a counter associated with the time interval representing the element. At 12:30, the time window was created and the linear buffer was initialized. In FIG. 5A, approximately 15 minutes have elapsed between the creation of the time window and the current time, 12:44:23. A counter associated with the element 501 has been incremented 7 times, indicating that 7 events have occurred in the time interval represented by the element 501. A counter associated with the element 502 has been incremented 34 times, indicating that 34 events have occurred in the time interval represented by the element 502. A counter associated with the element 503 has been incremented 50 times, indicating that 50 events have occurred in the time interval representing the element 503. A counter associated with the element 504 has been incremented 72 times, indicating that 72 events have occurred in the time interval representing the element 504.”)

    PNG
    media_image3.png
    191
    694
    media_image3.png
    Greyscale


Agashe and the combination of Zhong and Wang are analogous art because they are both in the field of machine learning. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Agashe with the combination of Zhong and Wang to include saving sequence information in exponentially increasing history windows.  One would have been motivated to do so in order to be able to make more accurate predictions based on more data, while saving on storage resources (Agashe, Para [0075]:  “Exponentially decaying time intervals may allow precise tracking of recent events while conserving storage capacity by passing a record of the occurrence of events to progressively larger time intervals as time progresses and the events become less recent”). 
The combination of Zhong, Wang, and Agashe fails to teach choose at least one hyperparameter for each of the plurality of history window to allow the system to trade off bias-variance; use a deep neural network classifier to map arbitrary length histories to an abstract alphabet having smaller length than the original alphabet of the arbitrary length history windows as an input sequence for the context tree weighting algorithm; apply a context tree weighting algorithm to the abstract alphabet resulting from the fixed set of discrete classes for each of the plurality of history windows. 
Tjalkens teaches and apply a context tree weighting algorithm to an [abstract] alphabet [resulting from the fixed set of discrete classes for each of the plurality of history windows] (Recall above that Wang teaches an abstract alphabet resulting from the fixed set of discrete classes for each of the plurality of history windows.  Tjalkens, Page 128 Introduction, discloses:  “This context tree weighting (CTW) algorithm achieves the asymptotically optimal redundancy behavior for the class of FSMX sources…We will discuss here the compression of multi-alphabet sources.  First we extend the CTW algorithm…to the non-binary case.”  Furthermore, on Page 129, Section 2 entitled “The multi-alphabet context tree weighting algorithm”, it begins:  “Let us consider the case where the source alphabet, A, is non-binary, i.e. |A| > 2.”  Finally, on Page 133 Section 5, Tjalkens states:  “We have seen that it is possible to use the context tree weighting algorithm for multi-alphabet sources and it still shows the optimal redundancy behavior.”  Thus, Tjalkens discloses applying CTW to an alphabet (A) that is a fixed set of discrete classes (|A| classes).
Tjalkens and the combination of Zhong, Wang, and Agashe are analogous art because they are both in the field of machine learning.  
The combination of Zhong, Wang, and Agashe is a sequence prediction algorithm (recall in the motivation statement for the combination of Wang and Zhong that both of these works are concerned with sequence prediction), and Tjalkens’ multi-alphabet CTW algorithm is concerned with compression, which is known in the art as being an equivalent problem to sequence prediction.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the multi-alphabet CTW algorithm of Tjalkens with the combination of Zhong, Wang, and Agashe.  One would have been motivated to do so to improve performance of the sequence prediction (Tjalkens, Pg. 128 Abstract:  “We report several results on this problem and describe some algorithms that realize an improved and even asymptotically optimal redundancy behavior.”)
The combination of Zhong, Wang, Agashe, and Tjalkens fails to teach choose at least one hyperparameter for each of the plurality of history window to allow the system to trade off bias-variance; use a deep neural network classifier to map arbitrary length histories to an abstract alphabet having smaller length than the original alphabet of the arbitrary length history windows as an input sequence for the context tree weighting algorithm.
Zaheer teaches and use a deep neural network classifier to map [arbitrary length] histories to a second alphabet having smaller length than the alphabet of the [arbitrary length] histories [as an input sequence for the context tree weighting algorithm]. (Zaheer, Page 1 Abstract, discloses:  “In this paper, we introduce Latent LSTM Allocation (LLA) for user modeling combining hierarchical Bayesian models with LSTMs. In LLA, each user is modeled as a sequence of actions, and the model jointly groups actions into topics and learns the temporal dynamics over the topic sequence, instead of action space directly.”  Zaheer, Page 8 Section 5, discloses:  “We achieve this by shifting from modeling the temporal dynamics at the observed word level to modeling the dynamics at a higher level of abstraction: topics. As the number of topics K is much smaller than the number of words V , it can act as a knob that can trade-off accuracy vs model size.”  Zaheer also discloses that this is used for sequence prediction in Page 2 Section 3:  “Lastly and most importantly, the model should be accurate in terms of predicting future events. We show how LLA satisfies all of these requirements.”  Thus, Zaheer discloses a system comprising a deep neural network (“LSTM”) as part of a classifier (makes a prediction of the next value of a sequence, which is a discrete set of classes), that maps histories to a second alphabet (“topics K”) having smaller length than the alphabet of the histories (“words V”).  Zaheer Page 2 Section 2.2 also states explicitly classification:  “LSTM, a type of RNN, is well suited for the task as it can learn from experience to classify, process, and predict time series when there are very long time lags of unknown size between important events.”)
Zaheer and the combination of Zhong, Wang, Agashe, and Tjalkens are analogous art because they are all in the field of machine learning.  
The combination of Zhong, Wang, Agashe, and Tjalkens is a sequence prediction algorithm, and Zaheer is also concerned with sequence prediction (Zaheer Page 2 Section 3:  “Lastly and most importantly, the model should be accurate in terms of predicting future events. We show how LLA satisfies all of these requirements.”)  Therefore, it would have been obvious before the effective filing date of the claimed invention to combine Zaheer’s smaller abstract alphabet with the sequence prediction algorithm of Zhong, Wang, Agashe, and Tjalkens.  One of ordinary skill in the art would be motivated to do so in order to build a smaller more efficient model that can still capture important temporal properties (Zaheer, Page 1 Intro:  “The increase in complexity and parameters arises due to a large action space in which many of the actions have similar intent or topic… learns the temporal dynamics over the topic sequence, instead of action space directly. This leads to a model that is highly interpretable, concise, and can capture intricate dynamics” and Zaheer Page 8 Section 5:  “As the number of topics K is much smaller than the number of words V , it can act as a knob that can trade-off accuracy vs model size…Furthermore, the topics provide an informative embedding that can reveal interesting temporal relationship as shown in Figure 6 and 7 – which is a novel contribution to the best of our knowledge.”)
The combination of Zhong, Wang, Agashe, Tjalkens, and Zaheer fails to teach choose at least one hyperparameter for each of the plurality of history window to allow the system to trade off bias-variance; use a deep neural network classifier to map arbitrary length histories to an abstract alphabet having smaller length than the original alphabet of the arbitrary length history windows as an input sequence for the context tree weighting algorithm.
Thorhallsson teaches choose at least one hyperparameter [for each of the plurality of history windows] to allow the system to trade off bias-variance.  (Thorhallsson, Intro Para 2, discloses that “This leads to a fundamental tradeoff known as the bias-variance tradeoff which is of paramount importance for optimal choice of the hyperparameters for the learning algorithms”).  *Wang, Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history” (i.e., a plurality of history windows comprising CNNs, and CNNs are a machine learning model and therefore have a bias-variance tradeoff).
Thorhallsson and the combination of Zhong, Wang, Agashe, Tjalkens, and Zaheer are analogous art because they are all in the field of machine learning.  
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Thorhallsson with the combination of Zhong, Wang, Agashe, Tjalkens, and Zaheer to include selecting the optimal choice of hyperparameters, for which the bias-variance tradeoff is of paramount importance.  One would have been motivated to do so to “capture sophisticated relationships in the data while keeping it simple to prevent noise from affecting the outcome” (Thorhallsson, Page 1 Intro, Paragraph 2).

As per Claim 11, the combination of Zhong, Wang, Agashe, Tjalkens, Zaheer, and Thorhallsson teaches the artificial intelligence system of claim 10. Zaheer teaches wherein a long short-term memory-based sequence to symbol method is used to map the [arbitrary length] histories to a second alphabet having smaller length than the alphabet of the [arbitrary length] histories. (Zaheer, as shown above in the rejection to Claim 5, maps histories to a smaller alphabet, and performs sequence prediction, using a LSTM.  Thus, Zaheer discloses using a long short-term memory based (“LSTM”) sequence to symbol (“sequence prediction”) method to map the histories to a second alphabet having smaller length than the alphabet of the histories. Zaheer, Page 3 end of Section 3.1 also states:  “The LSTM output represents topic proportions for document/user at time t. The LSTM input over topics can capture semantic notion of topics.”)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Zaheer with the combination Zhong, Wang, Agashe, Tjalkens, and Thorhallsson for at least the reasons recited in Claim 10.

As per Claim 12, the combination of Zhong, Wang, Agashe, Tjalkens, Zaheer, and Thorhallsson teaches the artificial intelligence system of claim 10.  Tjalkens suggests wherein the observed [sensory] sequence information is a binary event (Tjalkens, Page 128 Abstract, discloses:  “The redundancy term is linear in the number of free parameters, which for binary sources equals the number of states in the source.” Here, Tjalkens suggests that the original CTW algorithm is for binary sources, before extending the algorithm to non-binary sources.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Tjalkens with Zhong, Wang, Agashe, Zaheer, and Thorhallsson for at least the reasons recited in Claim 10.

As per Claim 13, the combination of Zhong, Wang, Agashe, Tjalkens, Zaheer, and Thorhallsson teaches the artificial intelligence system of claim 10.  Wang teaches wherein the instructions cause the at least one processor to perform a temporal convolution in a deep neural network to map observed sensory sequence information from the plurality of history windows to symbols. (Wang, Page 2 Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history.”  Wang, Section 2, defines each of the history windows as Convolutional Neural Networks (alpha-CNN and beta-CNN), which “capture the temporal structure” of the sequence.  Wang Page 3 Figure 2 and Page 5 Figure 4 illustrate that each CNN makes a prediction of the next word (i.e., symbol)).  
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Wang with the combination of Zhong, Agashe, Tjalkens, Zaheer, and Thorhallsson for at least the reasons recited in Claim 10.

Claims 14 and 15 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Zhong, Wang, Agashe, Tjalkens, Zaheer, and Thorhallsson further in view of Pan.  
As per Claim 14, the combination of Zhong, Wang, Agashe, Tjalkens, Zaheer, and Thorhallsson teaches the artificial intelligence system of claim 10 as well as temporal convolution and time steps in each of the plurality of history windows (see Rejection to Claim 10).  The combination of Zhong, Wang, Agashe, Tjalkens, Zaheer, and Thorhallsson fails to teach wherein the temporal convolution includes defining each of the plurality of history windows of events as a 2^k-by-n matrix, where 2^k is a number of time steps in each of the plurality of history windows and n is a number of events at each of the time steps, and applying a convolution that is an l-by-n matrix, where l is less than 2^k, wherein a set of the convolutions produces a new set of events.
Pan teaches wherein the [temporal] convolution includes defining each [of the plurality of history windows] of events as a (2^k)-by-n matrix, where 2^k is a number of [time steps in each of the plurality of history windows] and n is a number of events [at each of the time steps], and applying a convolution that is an l-by-n matrix, where l is less than 2^k, wherein a set of the convolutions produces a new set of events. (Pan Para [0054] discloses an “n*m feature matrix” in which one dimension “m is the quantity of sub time periods” (i.e., time steps) and the other dimension “n is the quantity of feature types” (i.e., events).  Pan Para [0060] discloses that  “The input layer inputs each sample (n*m feature matrix) to a convolutional layer for convolution.”  Pan further discloses “convolution kernel quantity and size can be specified as needed” (Examiner’s note:  a convolution kernel is a matrix) wherein “at least one of a row quantity or a column quantity of the convolution kernel can be a predetermined quantity of feature types”, which is analogous to the instant application, where “n” is the column dimension for both the 2^k-by-n matrix and the l-by-n matrix (Pan describes an “n-by-m feature matrix”, and “a size of each convolution kernel can be n*j”).  Pan further discloses that the remaining dimension of the convolution kernel must be less than the corresponding dimension of the feature matrix (“size of each convolution kernel can be n*j, where j is a positive integer less than m”).  This is analogous to the instant application, where l must be less than 2^k.  Pan Para [0060] further states that “convolutional layer can output 100s feature graphs.” (i.e. the output is a new set of events)).
Pan and the combination of Zhong, Wang, Agashe, Tjalkens, Zaheer, and Thorhallsson are analogous art because they are all in the field of machine learning.  
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Pan with the combination of Zhong, Wang, Agashe, Tjalkens, Zaheer, and Thorhallsson to include a convolution kernel wherein at least one of a row quantity or a column quantity of the convolution kernel can be a predetermined quantity of feature types (i.e., events).  One would have been motivated to do so “because feature types (i.e. events) are not continuous, the convolution kernel does not need to be scanned in a distribution direction of each feature type” (Pan, Para [0060], Sentence 3).

As per Claim 15, Zhong teaches an artificial intelligence system, comprising: a computing device including at least one processor, one or more data storage devices, and a non-transitory data storage medium interfaced with the at least one processor, the non-transitory data storage medium containing instructions that, when executed cause the at least one processor to (Zhong, Abstract, discloses:  “This model employs neural network architecture incorporating a predictive sensory module based on an RNNPB (Recurrent Neural Network with Parametric Biases) and a horizontal product model.”  Here, Zhong discloses an artificial intelligence system.  Zhong, last sentence of Section 1, discloses:  “Therefore it is sufficient to serve as a neural model that is running in a robot CPU to explain the language development in the cortical areas.” Here, Zhong discloses a computing device including at least one processor (a robot with a CPU), which also implies one or more data storage devices, and a non-transitory data storage medium interfaced with the at least one processor, the non-transitory data storage medium containing instructions that cause the at least one processor to execute.)
save observed sensory sequence information (Zhong, Page 3 Lines 4-10, discloses:  “A ball was attached at the end of the arm so that another robot could obtain the movement by tracking the ball. Our neural network was trained and run on the other NAO, which was called the “observer”. In this way, the observer robot perceived the object movement from its vision passively, so that its network took the object’s visual features and the movements into account”.  Here, Zhong discloses observed sensory sequence information, as the “observer robot” “perceived the object movement from its vision”, wherein “perceived” is another word for observed, and “vision” is sensory information, and “movement” implies this is over time, and thus a sequence of observed images.  Zhong, Section 2.2 also discloses “error back-propagated from a certain sensory information sequence to the PB units”.  While it is likely implied that the information is saved, Zhong explicitly recites to save it (“storing”) this in Conclusion Para 2 Lines 4-5: “These PB units allow for storing multiple sensory sequences.”)
Zhong fails to teach that the observed sensory sequence information is saved, as an original alphabet in a plurality of history windows, the plurality of history windows being reverse chronological history windows, wherein a size of the plurality of history windows increase exponentially, wherein a size of the plurality of history windows increase exponentially, wherein the plurality of history windows do not overlap; apply a function to the observed sensory sequence information in each history window, wherein the function maps the observed sensory sequence information into a fixed set of discrete classes, wherein the fixed set of discrete classes is the same fixed set of discrete classes across each of the plurality of history windows; choose at least one hyperparameter for each of the plurality of history window to allow the system to trade off bias-variance;Page 5 of 17Patent Application No. 15/888,619Atty. Docket No.: SON1-PAU03 Amendment dated April 1, 2022Response to Office Action of January 4, 2022perform a temporal convolution in a deep neural network to map observed sensory sequence information from the plurality of history windows to the abstract alphabet, wherein the temporal convolution includes defining each of the plurality of history windows of events as a ($2^k$)-by-n matrix, where $2^k$ is a number of time steps in each of the plurality of history windows and n is a number of events at each of the time steps, and applying a convolution that is an 1-by-n matric, where 1 is less than 2^k, wherein a set of the convolutions produces a new set of events, wherein the set of the convolutions includes a predetermined number of convolution layers, wherein the temporal convolution determines a maximum of the resulting events and ensures the fixed set of discreet classes is independent from the size of each of the plurality of history windows; apply a context tree weighting algorithm to the abstract alphabet resulting from the fixed set of discrete classes for each of the plurality of history windows to predict a future discrete sequence, wherein the abstract alphabet is smaller in size than the original alphabet; and make a prediction in the original alphabet based on patterns in the abstract alphabet. 
Wang teaches that the [observed sensory] sequence information is saved as an original alphabet, in a plurality of history windows, the plurality of history windows being reverse chronological history windows. (Recall above Zhong discloses the sequence information is observed sensory sequence information.  Wang, Page 2 Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history.”  Wang, Page 2 Section 2 Last Sentence, discloses “Also distinct from RNN, genCNN gains most of its processing power from the heavy-duty processing units (i.e., CNN and CNNs), which follow a bottom-up information flow and yet can adequately capture the temporal structure in word sequence with its convolutional-gating architecture.”  Wang discloses saved as an original alphabet in a plurality of history windows in Page 2 Figure 1:

    PNG
    media_image1.png
    107
    525
    media_image1.png
    Greyscale

Here, one can see in the “history” row, the English words, which are part of the “original alphabet” (the finite set of words in the English language as written in Latin characters).
Examiner’s Note:  Here, alphaCNN and betaCNN are the plurality of history windows, and they are saving word sequence information. These windows go back in time progressively further, and are thus reverse chronological history windows.)
apply a function to the observed [sensory] sequence information in each history window, wherein the function maps the observed [sensory] sequence information into a fixed set of discrete classes, wherein the fixed set of discrete classes is the same fixed set of discrete classes across each of the plurality of history windows and an abstract alphabet resulting from the fixed set of discrete classes for each of the plurality of history windows (Recall above Zhong discloses the sequence information is observed sensory sequence information.  For the following, see Wang Page 2 Figure 1 and Page 5 Figure 4:

    PNG
    media_image2.png
    820
    982
    media_image2.png
    Greyscale

Wang, Page 5 Section 3.3, discloses:  “As suggested early on in Section 2 and Figure 1, we use extra CNNs with conventional weight-sharing, named βCNN, to summarize the history out of scope of αCNN. More specifically, the output of βCNN (with the same dimension of word-embedding) is put before the first word as the input to the αCNN, as illustrated in Figure 4. Different from αCNN, βCNN is designed just to summarize the history, with weight shared across its convolution units.”  Here, Wang discloses to apply a function (“αCNN “ and “βCNN”) to the observed sequence information in each history window.  The function maps the sequence information into a fixed set of discrete classes (“word embedding”, as in “the output of βCNN (with the same dimension of word-embedding)”).  Note that the “fixed set of discrete classes here is the words of the English language as represented by “word embedding”, which is when words are represented by numeric values in a vector.  Wang Page 6 Section 4 discloses:  “The parameters of a genCNN θ consists of the parameters for CNN θnn, word-embedding θembed, and the parameters for soft-max θsoftmax.”, and on Page 7 Paragraph 1, discloses:  “We use word embedding with dimension 100” and in Para 3:  “For initialization, we use Word2Vec (Mikolov et al., 2013) for the starting state of the word-embeddings”.  There is a finite amount of words in the English language, and thus English words represented by word embeddings are a fixed set of discrete classes that do not depend on the size of the βCNN windows. The fixed set of discrete classes is the same fixed set of discrete classes across each of the plurality of history windows, as each window contains discrete classes derived from the same set of discrete classes, that is, words in the English language.  These embeddings are an abstract representation of English words (an “abstract alphabet”), as opposed to the more common original representation with Latin script, and these words ek (shown in Figure 4 above), in embedding form, are input to each βCNN (“The parameters of a genCNN θ consists of…word-embedding θembed)”.
	Wang also teaches make a prediction in the original alphabet based on patterns in the abstract alphabet (Wang, as shown above in the figure showing Fig. 1 and Fig. 5, makes a prediction in the original alphabet (“prediction:  ‘sandwich’”) based on patterns in the abstract alphabet (based on the outputs of the βCNNs being input into the αCNN)).
Zhong and Wang are analogous art because they are both in the field of machine learning.  
Zhong teaches a method to make a prediction based on historical sequence information, as stated in Zhong Page 6 Section 3.3:  “In this simulation, the obtained PB units from the previous recognition experiment were used to generate the predicted movements using the prior knowledge of a specific object.”  Wang also teaches a method of making a prediction based on historical sequence information as stated in Wang Abstract:  “Instead, we use a convolutional neural network to predict the next word with the history of words of variable length”, however Wang adds a plurality of history windows.  Zhong discloses using an RNN as stated in Zhong, Page 1 Abstract:  “This model employs neural network architecture incorporating a predictive sensory module based on an RNNPB (Recurrent Neural Network with Parametric Biases)”, and Wang discloses replacing RNN with their genCNN model as stated in Wang, Page 1 Abstract:  “Different from previous work on neural network-based language modeling and generation (e.g., RNN or LSTM), we choose not to greedily summarize the history of words as a fixed length vector. Instead, we use a convolutional neural network”.  To apply Wang’s replacement of RNN with genCNN to Zhong would result in being able to make a sensory sequence prediction based on a plurality of sensory sequence information windows.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Zhong and Wang.  One would have been motivated to do so in order to make more accurate predictions by utilizing more historical data (Wang, Abstract:  “We argue that our model can give adequate representation of the history, and therefore can naturally exploit both the short and long range dependencies”… “Our extensive experiments on text generation and n-best re-ranking in machine translation show that genCNN outperforms the state-of-the-arts with big margins”.  Wang, Page 6 Section 3.4.2 also discloses the advantages of the convolutional architecture:  “genCNN takes the ‘uncompressed’ history, therefore avoids • the difficulty in finding the representation for history (i.e., unfinished sentences), especially those end in the middle of a chunk (e.g.,“the cat sat on the”) • the damping effort in RNN when the history-summarizing hidden states are updated at each time, which renders the long term memory rather difficult.”)
The combination of Zhong and Wang fails to teach wherein a size of the plurality of history windows increase exponentially, wherein the plurality of history windows do not overlap;Page 6 of 17Patent Application No. 15/888,619Atty. Docket No.: SON1-PAU03Amendment dated December 9, 2021Response to Office Action of September 13, 2021 choose at least one hyperparameter for each of the plurality of history window to allow the system to trade off bias-variance; wherein the temporal convolution includes defining each of the plurality of history windows of events as a ($2^k$)-by-n matrix, where $2^k$ is a number of time steps in each of the plurality of history windows and n is a number of events at each of the time steps, and applying a convolution that is an 1-by-n matrix, where 1 is less than 2^k, wherein a set of the convolutions produces a new set of events, wherein the set of the convolutions includes a predetermined number of convolution layers, wherein the temporal convolution determines a maximum of the resulting events and ensures the fixed set of discreet classes is independent from the size of each of the plurality of history windows; apply a context tree weighting algorithm to the abstract alphabet resulting from the fixed set of discrete classes for each of the plurality of history windows to predict a future discrete sequence, wherein the abstract alphabet is smaller in size than the original alphabet.
Agashe teaches wherein a size of the plurality of history windows increase exponentially, wherein the plurality of history windows do not overlap. (Agashe Para [0076] discloses “FIGS. 5A and 5B illustrate an example linear buffer 500 representing a time window comprising exponentially-decaying time intervals and a series of associated counters in accordance with an embodiment of the invention. The time window and counters may track the occurrence of a particular type of event involving a particular asset. In FIG. 5A, element 501 represents the most recent 1-minute time interval beginning at 12:44 and ending at 12:45, element 502 represents a previous 2-minute time interval beginning at 12:42 and ending at 12:44, element 503 represents a previous 4-minute time interval beginning at 12:38 and ending at 12:42, and element 504 represents a previous 8-minute time interval beginning at 12:30 and ending at 12:38. Each element stores a counter associated with the time interval representing the element. At 12:30, the time window was created and the linear buffer was initialized. In FIG. 5A, approximately 15 minutes have elapsed between the creation of the time window and the current time, 12:44:23. A counter associated with the element 501 has been incremented 7 times, indicating that 7 events have occurred in the time interval represented by the element 501. A counter associated with the element 502 has been incremented 34 times, indicating that 34 events have occurred in the time interval represented by the element 502. A counter associated with the element 503 has been incremented 50 times, indicating that 50 events have occurred in the time interval representing the element 503. A counter associated with the element 504 has been incremented 72 times, indicating that 72 events have occurred in the time interval representing the element 504.”)

    PNG
    media_image3.png
    191
    694
    media_image3.png
    Greyscale


Agashe and the combination of Zhong and Wang are analogous art because they are both in the field of machine learning. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Agashe with the combination of Zhong and Wang to include saving sequence information in exponentially increasing history windows.  One would have been motivated to do so in order to be able to make more accurate predictions based on more data, while saving on storage resources (Agashe, Para [0075]:  “Exponentially decaying time intervals may allow precise tracking of recent events while conserving storage capacity by passing a record of the occurrence of events to progressively larger time intervals as time progresses and the events become less recent”). 
The combination of Zhong, Wang, and Agashe fails to teach choose at least one hyperparameter for each of the plurality of history window to allow the system to trade off bias-variance; wherein the temporal convolution includes defining each of the plurality of history windows of events as a ($2^k$)-by-n matrix, where $2^k$ is a number of time steps in each of the plurality of history windows and n is a number of events at each of the time steps, and applying a convolution that is an 1-by-n matrix, where 1 is less than 2^k, wherein a set of the convolutions produces a new set of events, wherein the set of the convolutions includes a predetermined number of convolution layers, wherein the temporal convolution determines a maximum of the resulting events and ensures the fixed set of discreet classes is independent from the size of each of the plurality of history windows; apply a context tree weighting algorithm to the abstract alphabet resulting from the fixed set of discrete classes for each of the plurality of history windows to predict a future discrete sequence, wherein the abstract alphabet is smaller in size than the original alphabet.
Tjalkens teaches and apply a context tree weighting algorithm to an [abstract] alphabet [resulting from the fixed set of discrete classes for each of the plurality of history windows] (Recall above that Wang teaches an abstract alphabet resulting from the fixed set of discrete classes for each of the plurality of history windows.  Tjalkens, Page 128 Introduction, discloses:  “This context tree weighting (CTW) algorithm achieves the asymptotically optimal redundancy behavior for the class of FSMX sources…We will discuss here the compression of multi-alphabet sources.  First we extend the CTW algorithm…to the non-binary case.”  Furthermore, on Page 129, Section 2 entitled “The multi-alphabet context tree weighting algorithm”, it begins:  “Let us consider the case where the source alphabet, A, is non-binary, i.e. |A| > 2.”  Finally, on Page 133 Section 5, Tjalkens states:  “We have seen that it is possible to use the context tree weighting algorithm for multi-alphabet sources and it still shows the optimal redundancy behavior.”  Thus, Tjalkens discloses applying CTW to an alphabet (A) that is a fixed set of discrete classes (|A| classes).
Tjalkens and the combination of Zhong, Wang, and Agashe are analogous art because they are both in the field of machine learning.  
The combination of Zhong, Wang, and Agashe is a sequence prediction algorithm (recall in the motivation statement for the combination of Wang and Zhong that both of these works are concerned with sequence prediction), and Tjalkens’ multi-alphabet CTW algorithm is concerned with compression, which is known in the art as being an equivalent problem to sequence prediction.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the multi-alphabet CTW algorithm of Tjalkens with the combination of Zhong, Wang, and Agashe.  One would have been motivated to do so to improve performance of the sequence prediction (Tjalkens, Pg. 128 Abstract:  “We report several results on this problem and describe some algorithms that realize an improved and even asymptotically optimal redundancy behavior.”)
The combination of Zhong, Wang, Agashe, and Tjalkens fails to teach choose at least one hyperparameter for each of the plurality of history window to allow the system to trade off bias-variance; wherein the temporal convolution includes defining each of the plurality of history windows of events as a ($2^k$)-by-n matrix, where $2^k$ is a number of time steps in each of the plurality of history windows and n is a number of events at each of the time steps, and applying a convolution that is an 1-by-n matrix, where 1 is less than 2^k, wherein a set of the convolutions produces a new set of events, wherein the set of the convolutions includes a predetermined number of convolution layers, wherein the temporal convolution determines a maximum of the resulting events and ensures the fixed set of discreet classes is independent from the size of each of the plurality of history windows; wherein the abstract alphabet is smaller in size than the original alphabet.
Zaheer teaches wherein the abstract alphabet is smaller in size than the original alphabet.  (Zaheer, Page 1 Abstract, discloses:  “In this paper, we introduce Latent LSTM Allocation (LLA) for user modeling combining hierarchical Bayesian models with LSTMs. In LLA, each user is modeled as a sequence of actions, and the model jointly groups actions into topics and learns the temporal dynamics over the topic sequence, instead of action space directly.”  Zaheer, Page 8 Section 5, discloses:  “We achieve this by shifting from modeling the temporal dynamics at the observed word level to modeling the dynamics at a higher level of abstraction: topics. As the number of topics K is much smaller than the number of words V , it can act as a knob that can trade-off accuracy vs model size.”  Here, Zaheer discloses an abstract alphabet (“topics K”) that is smaller in size than the original alphabet (“words V”)).
Zaheer and the combination of Zhong, Wang, Agashe, and Tjalkens are analogous art because they are all in the field of machine learning.  
The combination of Zhong, Wang, Agashe, and Tjalkens is a sequence prediction algorithm, and Zaheer is also concerned with sequence prediction (Zaheer Page 2 Section 3:  “Lastly and most importantly, the model should be accurate in terms of predicting future events. We show how LLA satisfies all of these requirements.”)  Therefore, it would have been obvious before the effective filing date of the claimed invention to combine Zaheer’s smaller abstract alphabet with the sequence prediction algorithm of Zhong, Wang, Agashe, and Tjalkens.  One of ordinary skill in the art would be motivated to do so in order to build a smaller more efficient model that can still capture important temporal properties (Zaheer, Page 1 Intro:  “The increase in complexity and parameters arises due to a large action space in which many of the actions have similar intent or topic… learns the temporal dynamics over the topic sequence, instead of action space directly. This leads to a model that is highly interpretable, concise, and can capture intricate dynamics” and Zaheer Page 8 Section 5:  “As the number of topics K is much smaller than the number of words V , it can act as a knob that can trade-off accuracy vs model size…Furthermore, the topics provide an informative embedding that can reveal interesting temporal relationship as shown in Figure 6 and 7 – which is a novel contribution to the best of our knowledge.”)
The combination of Zhong, Wang, Agashe, Tjalkens, and Zaheer fails to teach choose at least one hyperparameter for each of the plurality of history window to allow the system to trade off bias-variance; wherein the temporal convolution includes defining each of the plurality of history windows of events as a ($2^k$)-by-n matrix, where $2^k$ is a number of time steps in each of the plurality of history windows and n is a number of events at each of the time steps, and applying a convolution that is an 1-by-n matrix, where 1 is less than 2^k, wherein a set of the convolutions produces a new set of events, wherein the set of the convolutions includes a predetermined number of convolution layers, wherein the temporal convolution determines a maximum of the resulting events and ensures the fixed set of discreet classes is independent from the size of each of the plurality of history windows
Thorhallsson teaches choose at least one hyperparameter [for each of the plurality of history windows] to allow the system to trade off bias-variance.  (Thorhallsson, Intro Para 2, discloses that “This leads to a fundamental tradeoff known as the bias-variance tradeoff which is of paramount importance for optimal choice of the hyperparameters for the learning algorithms”).  *Wang, Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history” (i.e., a plurality of history windows comprising CNNs, and CNNs are a machine learning model and therefore have a bias-variance tradeoff).
Thorhallsson and the combination of Zhong, Wang, Agashe, Tjalkens, and Zaheer are analogous art because they are all in the field of machine learning.  
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Thorhallsson with the combination of Zhong, Wang, Agashe, Tjalkens, and Zaheer to include selecting the optimal choice of hyperparameters, for which the bias-variance tradeoff is of paramount importance.  One would have been motivated to do so to “capture sophisticated relationships in the data while keeping it simple to prevent noise from affecting the outcome” (Thorhallsson, Page 1 Intro, Paragraph 2).
The combination of Zhong, Wang, Agashe, Tjalkens, Zaheer, and Thorhallsson fails to teach wherein the temporal convolution includes defining each of the plurality of history windows of events as a ($2^k$)-by-n matrix, where $2^k$ is a number of time steps in each of the plurality of history windows and n is a number of events at each of the time steps, and applying a convolution that is an 1-by-n matrix, where 1 is less than 2^k, wherein a set of the convolutions produces a new set of events, wherein the set of the convolutions includes a predetermined number of convolution layers, wherein the temporal convolution determines a maximum of the resulting events and ensures the fixed set of discreet classes is independent from the size of each of the plurality of history windows.
Pan teaches wherein the [temporal] convolution includes defining each [of the plurality of history windows] of events as a (2^k)-by-n matrix, where 2^k is a number of [time steps in each of the plurality of history windows] and n is a number of events [at each of the time steps], and applying a convolution that is an l-by-n matrix, where l is less than 2^k, wherein a set of the convolutions produces a new set of events, wherein the set of the convolutions includes a predetermined number of convolution layers, wherein the temporal convolution determines a maximum of the resulting events and ensures the fixed set of discrete classes is independent from the size of each of the plurality of history windows. (Pan Para [0054] discloses an “n*m feature matrix” in which one dimension “m is the quantity of sub time periods” (i.e., time steps) and the other dimension “n is the quantity of feature types” (i.e., events).  Pan Para [0060] discloses that “The input layer inputs each sample (n*m feature matrix) to a convolutional layer for convolution.”  Pan further discloses “convolution kernel quantity and size can be specified as needed” (Examiner’s note:  a convolution kernel is a matrix) wherein “at least one of a row quantity or a column quantity of the convolution kernel can be a predetermined quantity of feature types”, which is analogous to the instant application, where “n” is the column dimension for both the 2^k-by-n matrix and the l-by-n matrix (Pan describes an “n-by-m feature matrix”, and “a size of each convolution kernel can be n*j”).  Pan further discloses that the remaining dimension of the convolution kernel must be less than the corresponding dimension of the feature matrix (“size of each convolution kernel can be n*j, where j is a positive integer less than m”).  This is analogous to the instant application, where l must be less than 2^k.  Pan Para [0060] further states that “convolutional layer can output 100s feature graphs.” (i.e. the output is a new set of events).  *Wang teaches temporal convolutions in a plurality of history windows:  Wang, Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history.”  Wang, Section 2, defines each of the history windows, which comprise time steps, as Convolutional Neural Networks (alpha-CNN and beta-CNN), which “capture the temporal structure” of the sequence.
Pan discloses a predetermined number of convolution layers in [0060]:  “The input layer inputs each sample (n*m feature matrix) to a convolutional layer for convolution” and [0062]:  “The previous step 2 and step 3 can be repeated multiple times. The previous step 2 and step 3 can be repeated multiple times. After a combination of the convolutional layer”.  Here, Pan discloses a predetermined number (1) of convolution layers.  Pan also discloses “wherein the temporal convolution determines a maximum of the resulting events”, as Pan discloses “maxpooling” in [0061]:  “After the obtained 100s feature graphs are processed by using an activation function (such as the RELU function), processed feature graphs are transferred to a pooling layer for pooling (for example, the maxpooling method can be used for pooling).”  Finally, Examiner points out that “and ensures the fixed set of discreet classes is independent from the size of each of the plurality of history windows” is non-limiting language that carries no patentable weight.  See MPEP 2111.04 (I):  “whereby clause in a method claim is not given weight when it simply expresses the intended result of a process step positively recited." Id. (quoting Minton v. Nat’l Ass’n of Securities Dealers, Inc., 336 F.3d 1373, 1381, 67 USPQ2d 1614, 1620 (Fed. Cir. 2003)).
Pan and the combination of Zhong, Wang, Agashe, Tjalkens, Zaheer, and Thorhallsson are analogous art because they are all in the field of machine learning.  
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Pan with the combination of Zhong, Wang, Agashe, Tjalkens, Zaheer, and Thorhallsson to include a convolution kernel wherein at least one of a row quantity or a column quantity of the convolution kernel can be a predetermined quantity of feature types (i.e., events).  One would have been motivated to do so “because feature types (i.e. events) are not continuous, the convolution kernel does not need to be scanned in a distribution direction of each feature type” (Pan, Para [0060], Sentence 3).
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Brandes et. al. (“ASAP: a machine learning framework for local protein properties”) discloses, in the paragraph spanning Pages 3-4:  “We used a variant of the alphabet with a reduced alphabet of 15 letter groups, previously used in ProFET (17). This reduces the amount of features, making the predictor less sensitive to over-fitting while making it easier to identify insights from high-level features (e.g. clusters of large and charged AAs). For a window of size 20, this eliminates 100 (potentially interacting) features”, thus disclosing making a prediction based on a smaller alphabet, and also disclosing history windows.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LEONARD A SIEGER whose telephone number is (571)272-9710. The examiner can normally be reached M-F 8:00 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann Lo can be reached on (571) 272-9767. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/L.A.S./Examiner, Art Unit 2126     
/ANN J LO/Supervisory Patent Examiner, Art Unit 2126