DETAILED ACTION 
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
The amendment filed 2021-12-09 has been entered. Claims 1-15 remain pending in the application. Applicant’s amendments to the specification and claims overcome each and every objection and 112 rejection previously set forth in the Non-Final Office Action mailed 2021-09-13.
Response to Arguments
Applicant's argument with respect to rejections under 35 U.S.C. 103 has been fully considered but is not persuasive.  Applicant argues on Remarks Page 11 that Begleiter “does not teach or fairly suggest a "fixed set of discrete classes" into which the sensory information is mapped” because “the number of classes vary with k”.  Examiner respectfully disagrees, as the number of classes is fixed within a given window, thus meeting the limitation of the claim “in each history window”.  In light of this, Applicant has “amended to describe the fixed set of discrete classes is fixed regardless of the size of each of the plurality of history windows”.  However, Examiner points out that the amendment, as written, does not make this perfectly clear.  The amendment states:  “apply a function to the observed sensory sequence information in each history window, wherein the function maps the observed sensory sequence information into a fixed set of discrete classes regardless of the size of each of the plurality of history windows”.  It is not clear what is “regardless of the size of the window”.  It could be that one is wherein the fixed set of discrete classes is the same fixed set of discrete classes across each of the plurality of history windows”.
Nevertheless, in light of the cumulative effect of all of Applicant’s amendments, including those regarding an “abstract alphabet”, Examiner has changed the mapping of the claimed limitations below to better reflect the claimed limitations and to more efficiently advance prosecution.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 3, 4 and 7 are rejected under 35 U.S.C. 103 as being unpatentable over Zhong et. al. (“Toward a self-organizing pre-symbolic neural model representing sensorimotor primitives”; hereinafter “Zhong”) in view of Wang et. al. (“genCNN: A Convolutional Architecture for Word Sequence Prediction”; hereinafter “Wang”), Agashe et. al. (US 2014/0095412 A1; hereinafter “Agashe”), and Begleiter et. al. (“On Prediction Using Variable Order Markov Models”; hereinafter “Begleiter”).
As per claim 1, Zhong teaches an artificial intelligence system, comprising: a computing device including at least one processor, one or more data storage devices, and a non-transitory data storage medium interfaced with the at least one processor, the non-transitory data storage medium containing instructions that, when executed cause the at least one processor to (Zhong, Page 1 Abstract, discloses:  “This model employs neural network architecture incorporating a predictive sensory module based on an RNNPB (Recurrent Neural Network with Parametric Biases) and a horizontal product model.”  Here, Zhong discloses an artificial intelligence system.  Zhong, last sentence of Page 2 Section 1, discloses:  “Therefore it is sufficient to serve as a neural model that is running in a robot CPU to explain the language development in the cortical areas.” Here, Zhong discloses a computing device including at least one processor (a robot with a CPU), which also implies one or more data storage devices, and a non-transitory data storage medium interfaced with the at least one processor, the non-transitory data storage medium containing instructions that cause the at least one processor to execute.)
save observed sensory sequence information (Zhong, Page 3 Lines 4-10, discloses:  “A ball was attached at the end of the arm so that another robot could obtain the movement by tracking the ball. Our neural network was trained and run on the other NAO, which was called the “observer”. In this way, the observer robot perceived the object movement from its vision passively, so that its network took the object’s visual features and the movements into account”.  Here, Zhong discloses observed sensory sequence information, as the “observer robot” “perceived the object movement from its vision”, wherein “perceived” is another word for observed, and “vision” is sensory information, and “movement” implies this is over time, and thus a sequence of observed images.  Zhong, Top Left Page 5 Section 2.2 also discloses “error back-propagated from a certain sensory information sequence to the PB units”.  While it is likely implied that the information is saved, Zhong explicitly recites to save it (“storing”) this in Page 9 Conclusion Para 2 Lines 4-5: “These PB units allow for storing multiple sensory sequences.”)
Zhong fails to teach that the observed sensory sequence information is saved, as an original alphabet in a plurality of history windows, the plurality of history windows being reverse chronological history windows. Zhong also fails to teach wherein a size of the plurality of history windows increase exponentially, wherein the plurality of history windows do not overlap; apply a function to the observed sensory sequence information in each history window, wherein the function maps the observed sensory sequence information into a fixed set of discrete classes regardless of the size of each of the plurality of history windows; apply a context tree weighting algorithm to an abstract alphabet resulting from the fixed set of discrete classes for each of the plurality of history windows make a prediction in the original alphabet based on patterns in the abstract alphabet. 
Wang teaches that the [observed sensory] sequence information is saved as an original alphabet, in a plurality of history windows, the plurality of history windows being reverse chronological history windows. (Recall above Zhong discloses the sequence information is observed sensory sequence information.  Wang, Page 2 Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history.”  Wang, Page 2 Section 2 Last Sentence, discloses “Also distinct from RNN, genCNN gains most of its processing power from the heavy-duty processing units (i.e., CNN and CNNs), which follow a bottom-up information flow and yet can adequately capture the temporal structure in word sequence with its convolutional-gating architecture.”  Wang discloses saved as an original alphabet in a plurality of history windows in Page 2 Figure 1:

    PNG
    media_image1.png
    107
    525
    media_image1.png
    Greyscale

Here, one can see in the “history” row, the English words, which are part of the “original alphabet” (the finite set of words in the English language as written in Latin characters).
Examiner’s Note:  Here, alphaCNN and betaCNN are the plurality of history windows, and they are saving word sequence information. These windows go back in time progressively further, and are thus reverse chronological history windows.)
Wang also teaches apply a function to the observed [sensory] sequence information in each history window, wherein the function maps the observed [sensory] sequence information into a fixed set of discrete classes regardless of the size of each of the plurality of history windows and the fixed set of discrete classes resulting in an abstract alphabet (Recall above Zhong discloses the sequence information is observed sensory sequence information.  For the following, see Wang Page 2 Figure 1 and Page 5 Figure 4:

    PNG
    media_image2.png
    820
    982
    media_image2.png
    Greyscale

Wang, Page 5 Section 3.3, discloses:  “As suggested early on in Section 2 and Figure 1, we use extra CNNs with conventional weight-sharing, named βCNN, to summarize the history out of scope of αCNN. More specifically, the output of βCNN (with the same dimension of word-embedding) is put before the first word as the input to the αCNN, as illustrated in Figure 4. Different from αCNN, βCNN is designed just to summarize the history, with weight shared across its convolution units.”  Here, Wang discloses to apply a function (“αCNN “ and “βCNN”) to the observed sequence information in each history window.  The function maps the sequence information into a fixed set of discrete classes (“word embedding”, as in “the output of βCNN (with the same dimension of word-embedding)”).  Note that the “fixed set of discrete classes here is the words of the English language as represented by “word embedding”, which is when words are represented by numeric values in a vector.  Wang Page 6 Section 4 discloses:  “The parameters of a genCNN consists of the parameters for CNN word-embedding”, and on Page 7 Paragraph 1, discloses:  “We use word embedding with dimension 100” and in Para 3:  “For initialization, we use Word2Vec (Mikolov et al., 2013) for the starting state of the word-embeddings”.  There is a finite amount of words in the English language, and thus English words represented by word embeddings are a fixed set of discrete classes that do not depend on the size of the βCNN windows.  These embeddings are an abstract representation of English words, as opposed to the more common original representation with Latin script.)
	Wang also teaches make a prediction in the original alphabet based on patterns in the abstract alphabet (Wang, as shown above in the figure showing Fig. 1 and Fig. 5, makes a prediction in the original alphabet (“prediction:  ‘sandwich’”) based on patterns in the abstract alphabet (based on the outputs of the βCNNs being input into the αCNN)).
Zhong and Wang are analogous art because they are both in the field of machine learning.  
Zhong teaches a method to make a prediction based on historical sequence information, as stated in Zhong Page 6 Section 3.3:  “In this simulation, the obtained PB units from the previous recognition experiment were used to generate the predicted movements using the prior knowledge of a specific object.”  Wang also teaches a method of making a prediction 
The combination of Zhong and Wang fails to teach wherein a size of the plurality of history windows increase exponentially, wherein the plurality of history windows do not overlap. The combination of Zhong and Wang also fails to teach apply a context tree weighting algorithm to an alphabet resulting from the fixed set of discrete classes for each of the plurality of history windows.
Agashe teaches wherein a size of the plurality of history windows increase exponentially, wherein the plurality of history windows do not overlap. (Agashe Para [0076] discloses “FIGS. 5A and 5B illustrate an example linear buffer 500 representing a time window comprising exponentially-decaying time intervals and a series of associated counters in accordance with an embodiment of the invention. The time window and counters may track the occurrence of a particular type of event involving a particular asset. In FIG. 5A, element 501 represents the most recent 1-minute time interval beginning at 12:44 and ending at 12:45, element 502 represents a previous 2-minute time interval beginning at 12:42 and ending at 12:44, element 503 represents a previous 4-minute time interval beginning at 12:38 and ending at 12:42, and element 504 represents a previous 8-minute time interval beginning at 12:30 and ending at 12:38. Each element stores a counter associated with the time interval representing the element. At 12:30, the time window was created and the linear buffer was initialized. In FIG. 5A, approximately 15 minutes have elapsed between the creation of the time window and the current time, 12:44:23. A counter associated with the element 501 has been incremented 7 times, indicating that 7 events have occurred in the time interval represented by the element 501. A counter associated with the element 502 has been incremented 34 times, indicating that 34 events have occurred in the time interval represented by the element 502. A counter associated with the element 503 has been incremented 50 times, indicating that 50 events have occurred in the time interval representing the element 503. A counter associated with the element 504 has been incremented 72 times, indicating that 72 events have occurred in the time interval representing the element 504.”)

    PNG
    media_image3.png
    191
    694
    media_image3.png
    Greyscale


Zhong, Wang, and Agashe are analogous art because they are all in the field of machine learning. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Agashe with the combination of Zhong and Wang to include saving sequence information in exponentially increasing history windows.  One would have been motivated to do so in order to be able to make more accurate predictions based on more data, while saving on storage resources (Agashe, Para [0075]:  “Exponentially decaying time intervals may allow precise tracking of 
The combination of Zhong, Wang, and Agashe fails to teach apply a context tree weighting algorithm to an abstract alphabet resulting from the fixed set of discrete classes for each of the plurality of history windows to predict a future discrete sequence.
Begleiter teaches and apply a context tree weighting algorithm to an [abstract] alphabet [resulting from the fixed set of discrete classes for each of the plurality of history windows] (Recall above that Wang teaches an abstract alphabet resulting from the fixed set of discrete classes for each of the plurality of history windows.  Begleiter, Page 397 Para 2-3, discloses:  “The second method we consider is Volf's `decomposed CTW', denoted here by DE-CTW' (Volf, 2002). The DE-CTW uses a tree-based hierarchical decomposition of the multi-valued prediction problem into binary problems. Each of the binary problems is solved via a slight variation of the binary CTW algorithm. Let Σ be an alphabet with size k = |Σ|.”  Here, Begleiter discloses applying a context tree weighting algorithm (“DE-CTW”) to an alphabet (“Let Σ be an alphabet with size k = |Σ|”).
Zhong, Wang, Agashe, and Begleiter are analogous art because they are all in the field of machine learning.  
The combination of Zhong, Wang, and Agashe is a sequence prediction algorithm (recall in the motivation statement for the combination of Wang and Zhong that both of these works are concerned with sequence prediction), and Begleiter’s DE-CTW algorithm is also concerned with sequence prediction (Page 397 Para 2: “The de-ctw uses a tree-based hierarchical 

As per Claim 3, the combination of Zhong, Wang, Agashe, and Begleiter teaches the artificial intelligence system of claim 1. Wang teaches wherein the function is a feature-wise maximum over time steps in one or more of the plurality of history windows. (Wang, Page 2 Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history.”  Wang, Page 4 Section 3.2 Line 1, discloses another embodiment in which “Previous CNNs, including those for NLP tasks (Hu et al., 2014; Kalchbrenner et al., 2014), take a straightforward convolution-pooling strategy, in which the ‘fusion’ decisions (e.g., selecting the largest one in max-pooling) are made based on the values of feature-maps”. Here, Wang is describing a plurality of history windows alphaCNN and betaCNN that comprise time steps.  Each of these history windows comprises a convolutional neural network (CNN).  These CNNs comprise a “convolution-pooling strategy” (i.e., function), such as “selecting the largest one” (i.e., maximum) “based on the values of feature-maps” (i.e. feature-wise). This function is done in each CNN, and is thus applied over time steps in the plurality of history windows). 
Zhong, Agashe, Begleiter, and Wang are analogous art because they are all in the field of machine learning.  
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine this teaching of Wang with the existing combination of Zhong, Agashe, Begleiter, and the primary teaching of Wang to include max-pooling (selecting the largest one) based on the values of feature maps.  One would have been motivated to do so to save time by relying on known work, as it is a “straightforward” strategy (Wang, Page 4 Section 3.2 Line 1:  “Previous CNNs, including those for NLP tasks (Hu et al., 2014; Kalchbrenner et al., 2014), take a straightforward convolution-pooling strategy, in which the ‘fusion’ decisions (e.g., selecting the largest one in max-pooling) are made based on the values of feature-maps”)).

As per Claim 4, the combination of Zhong, Wang, Agashe, and Begleiter teaches the artificial intelligence system of claim 3 and sensory sequence information (see Rejection to Claim 1).  Begleiter teaches wherein the observed sensory sequence information is a binary Begleiter, Page 393 Section 3.3, Paragraph 1, discloses “In this section we consider the original ctw algorithm for binary alphabets”.)
Zhong, Wang, Agashe, and Begleiter are analogous art because they are all in the field of machine learning.  
The combination of Zhong, Wang, and Agashe is a sequence prediction algorithm (recall in the motivation statement for the combination of Wang and Zhong that both of these works are concerned with sequence prediction), and Begleiter’s DE-CTW algorithm is also concerned with sequence prediction (Page 397 Para 2: “The de-ctw uses a tree-based hierarchical decomposition of the multi-valued prediction problem into binary problems.”) Begleiter’s DE-CTW algorithm is based on a simpler binary CTW algorithm.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the binary CTW algorithm of Begleiter with the combination of Zhong, Wang, and Agashe.  One would have been motivated to do so to efficiently handle predictions for large binary sequences (Begleiter, Pg. 393 Section 3.3 Para 1 and Pg. 396 Para 1:  “The Context Tree Weighting Method (ctw) algorithm (Willems et al., 1995) is a strong lossless compression algorithm that is based on a clever idea for combining exponentially many VMMs of bounded order” … “We presented the algorithm this way for simplicity. However, as already mentioned by Willems et al. (1995), it is possible to obtain linear time complexities for both training and prediction.”)

As per Claim 7, the combination of Zhong, Wang, Agashe, and Begleiter teaches the artificial intelligence system of claim 1 as well as at least one processor and sensory information (see Rejection to Claim 1).  Wang teaches wherein the instructions cause [the at least one processor] to perform a temporal convolution in a deep neural network to map observed [sensory] sequence information from the plurality of history windows to symbols. (Wang, Page 2 Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history.”  Wang, Section 2, defines each of the history windows as Convolutional Neural Networks (alpha-CNN and beta-CNN), which “capture the temporal structure” of the sequence.  Wang Page 3 Figure 2 and Page 5 Figure 4 illustrate that each CNN makes a prediction of the next word (i.e., symbol)).  
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Wang with the combination of Zhong, Agashe, and Begleiter for at least the reasons recited in Claim 1.

Claim 2 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Zhong, Wang, Agashe, and Begleiter, further in view of Thorhallson et. al. (“Visualizing the Bias-Variance Tradeoff”; hereinafter, “Thorhallsson”).
As per Claim 2, the combination of Zhong, Wang, Agashe, and Begleiter teaches the artificial intelligence system of claim 1 as well as at least one processor, and plurality of history windows (see Rejection to Claim 1).  However, the combination of Zhong, Wang, Agashe, and Begleiter fails to teach wherein the instructions cause the at least one processor to choose at least one hyperparameter for each of the plurality of history windows to allow the system to trade off bias-variance.
Thorhallsson teaches wherein the instructions cause [the at least one processor] to choose at least one hyperparameter [for each of the plurality of history windows] to allow the system to trade off bias-variance.  (Thorhallsson, Page 1 Intro Para 2, discloses that “This leads to a fundamental tradeoff known as the bias-variance tradeoff which is of paramount importance for optimal choice of the hyperparameters for the learning algorithms”). 
Zhong, Wang, Agashe, Begleiter, and Thorhallsson are analogous art because they are all in the field of machine learning.  
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Thorhallsson with the combination of Zhong, Wang, Agashe, and Begleiter to include selecting the optimal choice of hyperparameters, for which the bias-variance tradeoff is of paramount importance.  One would have been motivated to do so to “capture sophisticated relationships in the data while keeping it simple to prevent noise from affecting the outcome” (Thorhallsson, Page 1 Intro, Paragraph 2).

Claims 5 and 6 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Zhong, Wang, Agashe, and Begleiter, further in view of Campbell et. al. (US 2019/0012371 A1; hereinafter, “Campbell”).

As per Claim 5, the combination of Zhong, Wang, Agashe, and Begleiter teaches the artificial intelligence system of claim 1 as well as at least one processor, arbitrary length histories, and context tree weighting algorithm (see Rejection to Claim 1).  However, the combination of Zhong, Wang, Agashe, and Begleiter fails to teach wherein the instructions cause the at least one processor to use a deep neural network classifier to map arbitrary length histories to a second alphabet having smaller length than the alphabet of the arbitrary length histories as an input sequence for the context tree weighting algorithm.
Campbell teaches wherein [the instructions cause the at least one processor] to use a deep neural network classifier to map [arbitrary length] histories to a second alphabet having smaller length than the alphabet of the [arbitrary length] histories [as an input sequence for the context tree weighting algorithm]. (Campbell Para [0077] discloses “The belief tracker component 408 is configured to identify what table(s) of the database and what column(s) of the tables of the database are being implicated by the user's utterance. In particular, belief tracker component 408 implements a neural network (e.g., a recurrent neural network) that is configured to map dialog history to belief states. A belief state is a distribution over user goals and dialog states (e.g., context). The output of the belief tracker is an encoding of both the current user utterance and the history of utterances of user-system utterances. The user goal is related to: one or more tables of the database and their respective columns of metadata, such as names and data types; and vocabulary of columns (e.g., slots). The belief tracker component is configured to receive the feature vector as an input from the feature extractor component 404, concatenate the feature vector with the encoded dialog history that was generated by the context encoding component 406, and produce a probability distribution vector over the columns of the multiple tables of the database 410.”  Examiner’s Note:  A recurrent neural network is a type of deep neural network.  Campbell is using this to “map” (i.e., classify) “dialog history” to “belief states”.  The “belief states” are a smaller alphabet than the “user utterances” that comprise the “dialog history”, as they are “columns of the multiple tables of the database”)).
Zhong, Wang, Agashe, Begleiter, and Campbell are analogous art because they are all in the field of machine learning.  
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Campbell with the combination of Zhong, Wang, Agashe, and Begleiter to include mapping sequence histories to a smaller set of states.  One would have been motivated to do so to extract underlying context or meaning from a sequence, as Campbell states:  “map dialog history to belief states“. (Campbell Para [0077]).

As per Claim 6, the combination of Zhong, Wang, Agashe, Begleiter, and Campbell teaches the artificial intelligence system of claim 5.  Begleiter teaches wherein a long short-term memory-based sequence to symbol method is used to map the arbitrary length histories to a second alphabet having smaller length than the alphabet of the arbitrary length histories. (Campbell Para [0077] discloses “The belief tracker component 408 is configured to identify what table(s) of the database and what column(s) of the tables of the database are being implicated by the user's utterance. In particular, belief tracker component 408 implements a neural network (e.g., a recurrent neural network) that is configured to map dialog history to belief states. A belief state is a distribution over user goals and dialog states (e.g., context). The output of the belief tracker is an encoding of both the current user utterance and the history of utterances of user-system utterances. The user goal is related to: one or more tables of the database and their respective columns of metadata, such as names and data types; and vocabulary of columns (e.g., slots). The belief tracker component is configured to receive the feature vector as an input from the feature extractor component 404, concatenate the feature vector with the encoded dialog history that was generated by the context encoding component 406, and produce a probability distribution vector over the columns of the multiple tables of the database 410.”  Examiner’s Note:  A recurrent neural network is a type of deep neural network.  Campbell is using this to “map” (i.e., classify) “dialog history” to “belief states”.  The “belief states” are a smaller alphabet than the “user utterances” that comprise the “dialog history”, as they are “columns of the multiple tables of the database.”  Furthermore, Campbell, Para [0071] first sentence, discloses that “dialog manager component 412 and belief tracker component 408 are trained via the supervised learning component 602.” Campbell Para [0074] then discloses that “In some embodiments of the present invention, the supervised learning component 602 can be represented using a variety of suitable techniques, such as for example, multiplayer perceptron (MLP) representation, gated recurrent unit (GRU) representation, long-short term memory (LSTM) representation, and/or a memory network representation”).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Campbell with the combination of Zhong, Wang, Agashe, and Begleiter for at least the reasons recited in Claim 5.
Claims 8 and 9 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Zhong, Wang, Agashe, and Begleiter, further in view of Pan et. al. (US 2020/0211106 A1 ; hereinafter, “Pan”).
As per Claim 8, the combination of Zhong, Wang, Agashe, and Begleiter teaches the artificial intelligence system of claim 7 as well as plurality of history windows and temporal convolution (see Rejection to Claim 7).  Wang teaches time steps (Wang, Page 2 Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history.”  Wang, Page 4 Section 3.2 Line 1, discloses another embodiment in which “Previous CNNs, including those for NLP tasks (Hu et al., 2014; Kalchbrenner et al., 2014), take a straightforward convolution-pooling strategy, in which the ‘fusion’ decisions (e.g., selecting the largest one in max-pooling) are made based on the values of feature-maps”. Here, Wang is describing a plurality of history windows alphaCNN and betaCNN that comprise time steps.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Wang with the combination of Zhong, Agashe, and Begleiter for at least the reasons recited in Claim 1.
However, the combination of Zhong, Wang, Agashe, and Begleiter fails to teach wherein the temporal convolution includes defining each of the plurality of history windows of events as a (2^k)-by-n matrix, where 2^k is a number of time steps in each of the plurality of history windows and n is a number of events at each of the time steps, and applying a 
Pan teaches wherein the [temporal] convolution includes defining each [of the plurality of history windows] of events as a (2^k)-by-n matrix, where 2^k is a number of [time steps in each of the plurality of history windows] and n is a number of events [at each of the time steps], and applying a convolution that is an l-by-n matrix, where l is less than 2^k, wherein a set of the convolutions produces a new set of events. (Pan Para [0054] discloses an “n*m feature matrix” in which one dimension “m is the quantity of sub time periods” (i.e., time steps) and the other dimension “n is the quantity of feature types” (i.e., events).  Pan Para [0060] discloses that  “The input layer inputs each sample (n*m feature matrix) to a convolutional layer for convolution.”  Pan further discloses “convolution kernel quantity and size can be specified as needed” (Examiner’s note:  a convolution kernel is a matrix) wherein “at least one of a row quantity or a column quantity of the convolution kernel can be a predetermined quantity of feature types”, which is analogous to the instant application, where “n” is the column dimension for both the 2^k-by-n matrix and the l-by-n matrix (Pan describes an “n-by-m feature matrix”, and “a size of each convolution kernel can be n*j”).  Pan further discloses that the remaining dimension of the convolutional kernel must be less than the corresponding dimension of the feature matrix (“size of each convolution kernel can be n*j, where j is a positive integer less than m”).  This is analogous to the instant application, where l must be less than 2^k.  Pan Para [0060] further states that “convolutional layer can output 100s feature graphs.” (i.e. the output is a new set of events)).

Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Pan with the combination of Zhong, Wang, Agashe, and Begleiter to include a convolution kernel wherein at least one of a row quantity or a column quantity of the convolution kernel can be a predetermined quantity of feature types (i.e., events).  One would have been motivated to do so “because feature types (i.e. events) are not continuous, the convolution kernel does not need to be scanned in a distribution direction of each feature type” (Pan, Para [0060], Sentence 3).

As per Claim 9, the combination of Zhong, Wang, Agashe, Begleiter, and Pan teaches the artificial intelligence system of claim 8 as shown above.  Wang teaches wherein the convolution is applied to each of the plurality of history windows. (Wang, Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history. Wang Section 2 describes each of the history windows as Convolutional Neural Networks (CNNs)).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Wang with the combination of Zhong, Agashe, and Begleiter for at least the reasons recited in Claim 1.
 
Claims 10, 11, 12, and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Zhong in view of Wang, Agashe, Begleiter, Thorhallsson, and Campbell.  
As per Claim 10, Zhong teaches an artificial intelligence system, comprising: a computing device including at least one processor, one or more data storage devices, and a non-transitory data storage medium interfaced with the at least one processor, the non-transitory data storage medium containing instructions that, when executed cause the at least one processor to (Zhong, Page 1 Abstract, discloses:  “This model employs neural network architecture incorporating a predictive sensory module based on an RNNPB (Recurrent Neural Network with Parametric Biases) and a horizontal product model.”  Here, Zhong discloses an artificial intelligence system.  Zhong, Page 3 last sentence of Section 1, discloses:  “Therefore it is sufficient to serve as a neural model that is running in a robot CPU to explain the language development in the cortical areas.” Here, Zhong discloses a computing device including at least one processor (a robot with a CPU), which also implies one or more data storage devices, and a non-transitory data storage medium interfaced with the at least one processor, the non-transitory data storage medium containing instructions that cause the at least one processor to execute.)
save observed sensory sequence information (Zhong, Page 3 Lines 4-10, discloses:  “A ball was attached at the end of the arm so that another robot could obtain the movement by tracking the ball. Our neural network was trained and run on the other NAO, which was called the “observer”. In this way, the observer robot perceived the object movement from its vision passively, so that its network took the object’s visual features and the movements into account”.  Here, Zhong discloses observed sensory sequence information, as the “observer robot” “perceived the object movement from its vision”, wherein “perceived” is another word for observed, and “vision” is sensory information, and “movement” implies this is over time, and thus a sequence of observed images.  Zhong, Section 2.2 also discloses “error back-propagated from a certain sensory information sequence to the PB units”.  While it is likely implied that the information is saved, Zhong explicitly recites to save it (“storing”) this in Conclusion Para 2 Lines 4-5: “These PB units allow for storing multiple sensory sequences.”)
Zhong fails to teach that the observed sensory sequence information is saved, as an original alphabet in a plurality of history windows, the plurality of history windows being reverse chronological history windows. Zhong also fails to teach wherein a size of the plurality of history windows increase exponentially, wherein the plurality of history windows do not overlap; apply a function to the observed sensory sequence information in each history window, wherein the function maps the observed sensory sequence information into a fixed set of discrete classes, fixed for all of the plurality of history windows; choose at least one hyperparameter for each of the plurality of history window to allow the system to trade off bias-variance; use a deep neural network classifier to map arbitrary length histories to an abstract alphabet having smaller length than the original alphabet of the arbitrary length history windows as an input sequence for the context tree weighting algorithm; apply a context tree weighting algorithm to the abstract alphabet resulting from the fixed set of discrete classes for each of the plurality of history windows; and make a prediction in the original alphabet based on patterns in the abstract alphabet.
Wang teaches that the [observed sensory] sequence information is saved as an original alphabet, in a plurality of history windows, the plurality of history windows being reverse chronological history windows. (Recall above Zhong discloses the sequence information is observed sensory sequence information.  Wang, Page 2 Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history.”  Wang, Page 2 Section 2 Last Sentence, discloses “Also distinct from RNN, genCNN gains most of its processing power from the heavy-duty processing units (i.e., CNN and CNNs), which follow a bottom-up information flow and yet can adequately capture the temporal structure in word sequence with its convolutional-gating architecture.”  Wang discloses saved as an original alphabet in a plurality of history windows in Page 2 Figure 1:

    PNG
    media_image1.png
    107
    525
    media_image1.png
    Greyscale

Here, one can see in the “history” row, the English words, which are part of the “original alphabet” (the finite set of words in the English language as written in Latin characters).
Examiner’s Note:  Here, alphaCNN and betaCNN are the plurality of history windows, and they are saving word sequence information. These windows go back in time progressively further, and are thus reverse chronological history windows.)
Wang also teaches apply a function to the observed [sensory] sequence information in each history window, wherein the function maps the observed [sensory] sequence information into a fixed set of discrete classes regardless of the size of each of the plurality of history windows and the fixed set of discrete classes resulting in an abstract alphabet (Recall above Zhong discloses the sequence information is observed sensory sequence information.  For the following, see Wang Page 2 Figure 1 and Page 5 Figure 4:

    PNG
    media_image2.png
    820
    982
    media_image2.png
    Greyscale

Wang, Page 5 Section 3.3, discloses:  “As suggested early on in Section 2 and Figure 1, we use extra CNNs with conventional weight-sharing, named βCNN, to summarize the history out of scope of αCNN. More specifically, the output of βCNN (with the same dimension of word-embedding) is put before the first word as the input to the αCNN, as illustrated in Figure 4. Different from αCNN, βCNN is designed just to summarize the history, with weight shared across its convolution units.”  Here, Wang discloses to apply a function (“αCNN “ and “βCNN”) to the observed sequence information in each history window.  The function maps the sequence information into a fixed set of discrete classes (“word embedding”, as in “the output of βCNN (with the same dimension of word-embedding)”).  Note that the “fixed set of discrete classes here is the words of the English language as represented by “word embedding”, which is when words are represented by numeric values in a vector.  Wang Page 6 Section 4 discloses:  “The parameters of a genCNN consists of the parameters for CNN word-embedding”, and on Page 7 Paragraph 1, discloses:  “We use word embedding with dimension 100” and in Para 3:  “For initialization, we use Word2Vec (Mikolov et al., 2013) for the starting state of the word-embeddings”.  There is a finite amount of words in the English language, and thus English words represented by word embeddings are a fixed set of discrete classes that do not depend on the size of the βCNN windows.  These embeddings are an abstract representation of English words, as opposed to the more common original representation with Latin script.)
	Wang also teaches make a prediction in the original alphabet based on patterns in the abstract alphabet (Wang, as shown above in the figure showing Fig. 1 and Fig. 5, makes a prediction in the original alphabet (“prediction:  ‘sandwich’”) based on patterns in the abstract alphabet (based on the outputs of the βCNNs being input into the αCNN)).

Zhong teaches a method to make a prediction based on historical sequence information, as stated in Zhong Page 6 Section 3.3:  “In this simulation, the obtained PB units from the previous recognition experiment were used to generate the predicted movements using the prior knowledge of a specific object.”  Wang also teaches a method of making a prediction based on historical sequence information as stated in Wang Abstract:  “Instead, we use a convolutional neural network to predict the next word with the history of words of variable length”, however Wang adds a plurality of history windows.  Zhong discloses using an RNN as stated in Zhong, Page 1 Abstract:  “This model employs neural network architecture incorporating a predictive sensory module based on an RNNPB (Recurrent Neural Network with Parametric Biases)”, and Wang discloses replacing RNN with their genCNN model as stated in Wang, Page 1 Abstract:  “Different from previous work on neural network-based language modeling and generation (e.g., RNN or LSTM), we choose not to greedily summarize the history of words as a fixed length vector. Instead, we use a convolutional neural network”.  To apply Wang’s replacement of RNN with genCNN to Zhong would result in being able to make a sensory sequence prediction based on a plurality of sensory sequence information windows.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Zhong and Wang.  One would have been motivated to do so in order to make more accurate predictions by utilizing more historical data (Wang, Abstract:  “We argue that our model can give adequate representation of the history, and therefore can naturally exploit both the short and long range dependencies”… 
The combination of Zhong and Wang fails to teach wherein a size of the plurality of history windows increase exponentially, wherein the plurality of history windows do not overlap. The combination of Zhong and Wang also fails to teach wherein a size of the plurality of history windows increase exponentially, wherein the plurality of history windows do not overlap; choose at least one hyperparameter for each of the plurality of history window to allow the system to trade off bias-variance; use a deep neural network classifier to map arbitrary length histories to an abstract alphabet having smaller length than the original alphabet of the arbitrary length history windows as an input sequence for the context tree weighting algorithm; apply a context tree weighting algorithm to the abstract alphabet resulting from the fixed set of discrete classes for each of the plurality of history windows.
Agashe teaches wherein a size of the plurality of history windows increase exponentially, wherein the plurality of history windows do not overlap. (Agashe Para [0076] discloses “FIGS. 5A and 5B illustrate an example linear buffer 500 representing a time window comprising exponentially-decaying time intervals and a series of associated counters in accordance with an embodiment of the invention. The time window and counters may track the occurrence of a particular type of event involving a particular asset. In FIG. 5A, element 501 represents the most recent 1-minute time interval beginning at 12:44 and ending at 12:45, element 502 represents a previous 2-minute time interval beginning at 12:42 and ending at 12:44, element 503 represents a previous 4-minute time interval beginning at 12:38 and ending at 12:42, and element 504 represents a previous 8-minute time interval beginning at 12:30 and ending at 12:38. Each element stores a counter associated with the time interval representing the element. At 12:30, the time window was created and the linear buffer was initialized. In FIG. 5A, approximately 15 minutes have elapsed between the creation of the time window and the current time, 12:44:23. A counter associated with the element 501 has been incremented 7 times, indicating that 7 events have occurred in the time interval represented by the element 501. A counter associated with the element 502 has been incremented 34 times, indicating that 34 events have occurred in the time interval represented by the element 502. A counter associated with the element 503 has been incremented 50 times, indicating that 50 events have occurred in the time interval representing the element 503. A counter associated with the element 504 has been incremented 72 times, indicating that 72 events have occurred in the time interval representing the element 504.”)

    PNG
    media_image3.png
    191
    694
    media_image3.png
    Greyscale


Zhong, Wang, and Agashe are analogous art because they are all in the field of machine learning. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Agashe with the combination of Zhong and Wang to include saving sequence information in exponentially increasing history windows.  One would have been motivated to do so in order to be able to make more accurate predictions based on more data, while saving on storage resources (Agashe, Para [0075]:  “Exponentially decaying time intervals may allow precise tracking of recent events while conserving storage capacity by passing a record of the occurrence of events to progressively larger time intervals as time progresses and the events become less recent”). 
The combination of Zhong, Wang, and Agashe fails to teach choose at least one hyperparameter for each of the plurality of history window to allow the system to trade off bias-variance; use a deep neural network classifier to map arbitrary length histories to an abstract alphabet having smaller length than the original alphabet of the arbitrary length history windows as an input sequence for the context tree weighting algorithm; apply a context tree weighting algorithm to the abstract alphabet resulting from the fixed set of discrete classes for each of the plurality of history windows. 
Begleiter teaches and apply a context tree weighting algorithm to an [abstract] alphabet [resulting from the fixed set of discrete classes for each of the plurality of history windows] (Recall above that Wang teaches an abstract alphabet resulting from the fixed set of discrete classes for each of the plurality of history windows.  Begleiter, Page 397 Para 2-3, discloses:  “The second method we consider is Volf's `decomposed CTW', denoted here by DE-CTW' (Volf, 2002). The DE-CTW uses a tree-based hierarchical decomposition of the multi-valued prediction problem into binary problems. Each of the binary problems is solved via a slight variation of the binary CTW algorithm. Let Σ be an alphabet with size k = |Σ|.”  Here, Begleiter discloses applying a context tree weighting algorithm (“DE-CTW”) to an alphabet (“Let Σ be an alphabet with size k = |Σ|”).
Zhong, Wang, Agashe, and Begleiter are analogous art because they are all in the field of machine learning.  
The combination of Zhong, Wang, and Agashe is a sequence prediction algorithm (recall in the motivation statement for the combination of Wang and Zhong that both of these works are concerned with sequence prediction), and Begleiter’s DE-CTW algorithm is also concerned with sequence prediction (Page 397 Para 2: “The de-ctw uses a tree-based hierarchical decomposition of the multi-valued prediction problem into binary problems.”) Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the DE-CTW algorithm of Begleiter with the combination of Zhong, Wang, and Agashe.  One would have been motivated to do so to maximize accuracy of the prediction results (Begleiter, Pg. 413 Para 1 and Para 5:  “While it should not be expected that one algorithm will consistently outperform others on all tasks, our rather extensive empirical evaluations over the three domains indicate that there are prominent algorithms, which consistently tend to generate more accurate predictions than the other algorithms we examine. These algorithms are the `prediction by partial match' (ppm-c) and `decomposed 
The combination of Zhong, Wang, Agashe, and Begleiter fails to teach choose at least one hyperparameter for each of the plurality of history window to allow the system to trade off bias-variance; use a deep neural network classifier to map arbitrary length histories to an abstract alphabet having smaller length than the original alphabet of the arbitrary length history windows as an input sequence for the context tree weighting algorithm.
Thorhallsson teaches choose at least one hyperparameter [for each of the plurality of history windows] to allow the system to trade off bias-variance.  (Thorhallsson, Intro Para 2, discloses that “This leads to a fundamental tradeoff known as the bias-variance tradeoff which is of paramount importance for optimal choice of the hyperparameters for the learning algorithms”).  *Wang, Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history” (i.e., a plurality of history windows comprising CNNs, and CNNs are a machine learning model and therefore have a bias-variance tradeoff).
Zhong, Wang, Agashe, Begleiter, and Thorhallsson are analogous art because they are all in the field of machine learning.  
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Thorhallsson with the combination of Zhong, Wang, Agashe, and Begleiter to include selecting the optimal choice of 
The combination of Zhong, Wang, Agashe, Begleiter, and Thorhallsson fails to teach use a deep neural network classifier to map arbitrary length histories to an abstract alphabet having smaller length than the original alphabet of the arbitrary length history windows as an input sequence for the context tree weighting algorithm.
Campbell teaches and use a deep neural network classifier to map [arbitrary length] histories to a second alphabet having smaller length than the alphabet of the [arbitrary length] histories [as an input sequence for the context tree weighting algorithm]. (Campbell Para [0077] discloses “The belief tracker component 408 is configured to identify what table(s) of the database and what column(s) of the tables of the database are being implicated by the user's utterance. In particular, belief tracker component 408 implements a neural network (e.g., a recurrent neural network) that is configured to map dialog history to belief states. A belief state is a distribution over user goals and dialog states (e.g., context). The output of the belief tracker is an encoding of both the current user utterance and the history of utterances of user-system utterances. The user goal is related to: one or more tables of the database and their respective columns of metadata, such as names and data types; and vocabulary of columns (e.g., slots). The belief tracker component is configured to receive the feature vector as an input from the feature extractor component 404, concatenate the feature vector with the encoded dialog history that was generated by the context encoding component 406, and produce a probability distribution vector over the columns of the multiple tables of the database 410.”  Examiner’s Note:  A recurrent neural network is a type of deep neural network.  Campbell is using this to “map” (i.e., classify) “dialog history” to “belief states”.  The “belief states” are a smaller alphabet than the “user utterances” that comprise the “dialog history”, as they are “columns of the multiple tables of the database”).  *Wang discloses arbitrary length histories with the repeating betaCNN windows:  Wang, Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history.” *Begleiter discloses mapping an alphabet before using as input to a CTW algorithm:  Begleiter, Section 3.4 Paragraph 2, discloses “concatenating binary words of size k, one for each alphabet symbol” and uses this for “application of the standard binary ctw algorithm over a binary representation of the sequence”.  
Zhong, Wang, Agashe, Begleiter, and Campbell are analogous art because they are all in the field of machine learning.  
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Campbell with the combination of Zhong, Wang, Agashe, and Begleiter to include mapping sequence histories to a smaller set of states.  One would have been motivated to do so to extract underlying context or meaning from a sequence, as Campbell states:  “map dialog history to belief states“. (Campbell Para [0077]).

As per Claim 11, the combination of Zhong, Wang, Agashe, Begleiter, Thorhallsson, and Campbell teaches the artificial intelligence system of claim 10. Campbell teaches wherein a long short-term memory-based sequence to symbol method is used to map the arbitrary length histories to the minimal output symbol alphabet.  (Campbell Para [0077] discloses “The belief tracker component 408 is configured to identify what table(s) of the database and what column(s) of the tables of the database are being implicated by the user's utterance. In particular, belief tracker component 408 implements a neural network (e.g., a recurrent neural network) that is configured to map dialog history to belief states. A belief state is a distribution over user goals and dialog states (e.g., context). The output of the belief tracker is an encoding of both the current user utterance and the history of utterances of user-system utterances. The user goal is related to: one or more tables of the database and their respective columns of metadata, such as names and data types; and vocabulary of columns (e.g., slots). The belief tracker component is configured to receive the feature vector as an input from the feature extractor component 404, concatenate the feature vector with the encoded dialog history that was generated by the context encoding component 406, and produce a probability distribution vector over the columns of the multiple tables of the database 410.”  Examiner’s Note:  A recurrent neural network is a type of deep neural network.  Campbell is using this to “map” (i.e., classify) “dialog history” to “belief states”.  The “belief states” are a smaller alphabet than the “user utterances” that comprise the “dialog history”, as they are “columns of the multiple tables of the database.”  Furthermore, Campbell, Para [0071] first sentence, discloses that “dialog manager component 412 and belief tracker component 408 are trained via the supervised learning component 602.” Campbell Para [0074] then discloses that “In some embodiments of the present invention, the supervised learning component 602 can be represented using a variety of suitable techniques, such as for example, multiplayer perceptron (MLP) representation, gated recurrent unit (GRU) representation, long-short term memory (LSTM) representation, and/or a memory network representation”). 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Campbell with the combination of Zhong, Wang, Agashe, Begleiter, and Thorhallsson for at least the reasons recited in Claim 10.

As per Claim 12, the combination of Zhong, Wang, Agashe, Begleiter, Thorhallsson, and Campbell teaches the artificial intelligence system of claim 10.  Begleiter teaches wherein the observed sensory sequence information is a binary event. Begleiter teaches wherein the observed sensory sequence information is a binary event (Begleiter, Page 393 Section 3.3, Paragraph 1, discloses “In this section we consider the original ctw algorithm for binary alphabets”.)
Zhong, Wang, Agashe, and Begleiter are analogous art because they are all in the field of machine learning.  
The combination of Zhong, Wang, and Agashe is a sequence prediction algorithm (recall in the motivation statement for the combination of Wang and Zhong that both of these works are concerned with sequence prediction), and Begleiter’s DE-CTW algorithm is also concerned with sequence prediction (Page 397 Para 2: “The de-ctw uses a tree-based hierarchical 

As per Claim 13, the combination of Zhong, Wang, Agashe, Begleiter, Thorhallsson, and Campbell teaches the artificial intelligence system of claim 10.  Wang teaches wherein the instructions cause the at least one processor to perform a temporal convolution in a deep neural network to map observed sensory sequence information from the plurality of history windows to symbols. (Wang, Page 2 Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history.”  Wang, Section 2, defines each of the history windows as Convolutional Neural Networks (alpha-CNN and beta-CNN), which “capture the temporal structure” of the sequence.  Wang Page 3 Figure 2 and Page 5 Figure 4 illustrate that each CNN makes a prediction of the next word (i.e., symbol)).  
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Wang with the combination of Zhong, Agashe, Begleiter, Thorhallsson, and Campbell for at least the reasons recited in Claim 10.

Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Zhong, Wang, Agashe, Begleiter, Thorhallsson, Campbell, further in view of Pan.  
As per Claim 14, the combination of Zhong, Wang, Agashe, Begleiter, Thorhallsson, and Campbell teaches the artificial intelligence system of claim 10 as well as temporal convolution and time steps in each of the plurality of history windows (see Rejection to Claim 10).  The combination of Zhong, Wang, Agashe, Begleiter, Thorhallsson, and Campbell fails to teach wherein the temporal convolution includes defining each of the plurality of history windows of events as a 2^k-by-n matrix, where 2^k is a number of time steps in each of the plurality of history windows and n is a number of events at each of the time steps, and applying a convolution that is an l-by-n matrix, where l is less than 2^k, wherein a set of the convolutions produces a new set of events.
Pan teaches wherein the [temporal] convolution includes defining each [of the plurality of history windows] of events as a (2^k)-by-n matrix, where 2^k is a number of [time steps in each of the plurality of history windows] and n is a number of events [at each of the time steps], and applying a convolution that is an l-by-n matrix, where l is less than 2^k, wherein a (Pan Para [0054] discloses an “n*m feature matrix” in which one dimension “m is the quantity of sub time periods” (i.e., time steps) and the other dimension “n is the quantity of feature types” (i.e., events).  Pan Para [0060] discloses that  “The input layer inputs each sample (n*m feature matrix) to a convolutional layer for convolution.”  Pan further discloses “convolution kernel quantity and size can be specified as needed” (Examiner’s note:  a convolution kernel is a matrix) wherein “at least one of a row quantity or a column quantity of the convolution kernel can be a predetermined quantity of feature types”, which is analogous to the instant application, where “n” is the column dimension for both the 2^k-by-n matrix and the l-by-n matrix (Pan describes an “n-by-m feature matrix”, and “a size of each convolution kernel can be n*j”).  Pan further discloses that the remaining dimension of the convolution kernel must be less than the corresponding dimension of the feature matrix (“size of each convolution kernel can be n*j, where j is a positive integer less than m”).  This is analogous to the instant application, where l must be less than 2^k.  Pan Para [0060] further states that “convolutional layer can output 100s feature graphs.” (i.e. the output is a new set of events)).
Zhong, Wang, Agashe, Begleiter, and Pan are analogous art because they are all in the field of machine learning.  
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Pan with the combination of Zhong, Wang, Agashe, and Begleiter to include a convolution kernel wherein at least one of a row quantity or a column quantity of the convolution kernel can be a predetermined quantity of feature types (i.e., events).  One would have been motivated to do .

Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over Zhong in view of Wang, Agashe, Begleiter, Thorhallsson, and Pan.  
As per Claim 15, Zhong teaches an artificial intelligence system, comprising: a computing device including at least one processor, one or more data storage devices, and a non-transitory data storage medium interfaced with the at least one processor, the non-transitory data storage medium containing instructions that, when executed cause the at least one processor to (Zhong, Abstract, discloses:  “This model employs neural network architecture incorporating a predictive sensory module based on an RNNPB (Recurrent Neural Network with Parametric Biases) and a horizontal product model.”  Here, Zhong discloses an artificial intelligence system.  Zhong, last sentence of Section 1, discloses:  “Therefore it is sufficient to serve as a neural model that is running in a robot CPU to explain the language development in the cortical areas.” Here, Zhong discloses a computing device including at least one processor (a robot with a CPU), which also implies one or more data storage devices, and a non-transitory data storage medium interfaced with the at least one processor, the non-transitory data storage medium containing instructions that cause the at least one processor to execute.)
save observed sensory sequence information (Zhong, Page 3 Lines 4-10, discloses:  “A ball was attached at the end of the arm so that another robot could obtain the movement by tracking the ball. Our neural network was trained and run on the other NAO, which was called the “observer”. In this way, the observer robot perceived the object movement from its vision passively, so that its network took the object’s visual features and the movements into account”.  Here, Zhong discloses observed sensory sequence information, as the “observer robot” “perceived the object movement from its vision”, wherein “perceived” is another word for observed, and “vision” is sensory information, and “movement” implies this is over time, and thus a sequence of observed images.  Zhong, Section 2.2 also discloses “error back-propagated from a certain sensory information sequence to the PB units”.  While it is likely implied that the information is saved, Zhong explicitly recites to save it (“storing”) this in Conclusion Para 2 Lines 4-5: “These PB units allow for storing multiple sensory sequences.”)
Zhong fails to teach that the observed sensory sequence information is saved, as an original alphabet in a plurality of history windows, the plurality of history windows being reverse chronological history windows. Zhong also fails to teach wherein a size of the plurality of history windows increase exponentially, wherein the plurality of history windows do not overlap; Page 6 of 17Patent Application No. 15/888,619Atty. Docket No.: SON1-PAU03Amendment dated December 9, 2021Response to Office Action of September 13, 2021apply a function to the observed sensory sequence information in each history window, wherein the function maps the observed sensory sequence information into a fixed set of discrete classes, fixed for all of the plurality of history windows; choose at least one hyperparameter for each of the plurality of history window to allow the system to trade off bias-variance; perform a temporal convolution in a deep neural network to map observed sensory sequence information from the plurality of history windows to the abstract alphabet, wherein the temporal convolution includes defining each of the plurality of history windows of 
Wang teaches that the [observed sensory] sequence information is saved as an original alphabet, in a plurality of history windows, the plurality of history windows being reverse chronological history windows. (Recall above Zhong discloses the sequence information is observed sensory sequence information.  Wang, Page 2 Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history.”  Wang, Page 2 Section 2 Last Sentence, discloses “Also distinct from RNN, genCNN gains most of its processing power from the heavy-duty processing units (i.e., CNN and CNNs), which follow a bottom-up information flow and yet can adequately capture the temporal structure in word sequence with its convolutional-gating architecture.”  Wang discloses saved as an original alphabet in a plurality of history windows in Page 2 Figure 1:

    PNG
    media_image1.png
    107
    525
    media_image1.png
    Greyscale

Here, one can see in the “history” row, the English words, which are part of the “original alphabet” (the finite set of words in the English language as written in Latin characters).
Examiner’s Note:  Here, alphaCNN and betaCNN are the plurality of history windows, and they are saving word sequence information. These windows go back in time progressively further, and are thus reverse chronological history windows.)
Wang also teaches apply a function to the observed [sensory] sequence information in each history window, wherein the function maps the observed [sensory] sequence information into a fixed set of discrete classes regardless of the size of each of the plurality of history windows and the fixed set of discrete classes resulting in an abstract alphabet (Recall above Zhong discloses the sequence information is observed sensory sequence information.  For the following, see Wang Page 2 Figure 1 and Page 5 Figure 4:

    PNG
    media_image2.png
    820
    982
    media_image2.png
    Greyscale

Wang, Page 5 Section 3.3, discloses:  “As suggested early on in Section 2 and Figure 1, we use extra CNNs with conventional weight-sharing, named βCNN, to summarize the history out of scope of αCNN. More specifically, the output of βCNN (with the same dimension of word-embedding) is put before the first word as the input to the αCNN, as illustrated in Figure 4. Different from αCNN, βCNN is designed just to summarize the history, with weight shared across its convolution units.”  Here, Wang discloses to apply a function (“αCNN “ and “βCNN”) to the observed sequence information in each history window.  The function maps the sequence information into a fixed set of discrete classes (“word embedding”, as in “the output of βCNN (with the same dimension of word-embedding)”).  Note that the “fixed set of discrete classes here is the words of the English language as represented by “word embedding”, which is when words are represented by numeric values in a vector.  Wang Page 6 Section 4 discloses:  “The parameters of a genCNN consists of the parameters for CNN word-embedding”, and on Page 7 Paragraph 1, discloses:  “We use word embedding with dimension 100” and in Para 3:  “For initialization, we use Word2Vec (Mikolov et al., 2013) for the starting state of the word-embeddings”.  There is a finite amount of words in the English language, and thus English words represented by word embeddings are a fixed set of discrete classes that do not depend on the size of the βCNN windows.  These embeddings are an abstract representation of English words, as opposed to the more common original representation with Latin script.)
	Wang also teaches perform a temporal convolution in a deep neural network to map observed [sensory] sequence information from the plurality of history windows to symbols. (Wang, Page 2 Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history.”  Wang, Section 2, defines each of the history windows as Convolutional Neural Networks (alpha-CNN and beta-CNN), which “capture the temporal structure” of the sequence.  Wang Page 3 Figure 2 and Page 5 Figure 4 illustrate that each CNN makes a prediction of the next word (i.e., symbol)).  
	Wang also teaches make a prediction in the original alphabet based on patterns in the abstract alphabet (Wang, as shown above in the figure showing Fig. 1 and Fig. 5, makes a prediction in the original alphabet (“prediction:  ‘sandwich’”) based on patterns in the abstract alphabet (based on the outputs of the βCNNs being input into the αCNN)).
Zhong and Wang are analogous art because they are both in the field of machine learning.  
Zhong teaches a method to make a prediction based on historical sequence information, as stated in Zhong Page 6 Section 3.3:  “In this simulation, the obtained PB units from the previous recognition experiment were used to generate the predicted movements using the prior knowledge of a specific object.”  Wang also teaches a method of making a prediction based on historical sequence information as stated in Wang Abstract:  “Instead, we use a convolutional neural network to predict the next word with the history of words of variable length”, however Wang adds a plurality of history windows.  Zhong discloses using an RNN as stated in Zhong, Page 1 Abstract:  “This model employs neural network architecture incorporating a predictive sensory module based on an RNNPB (Recurrent Neural Network with Parametric Biases)”, and Wang discloses replacing RNN with their genCNN model as stated in Wang, Page 1 Abstract:  “Different from previous work on neural network-based language modeling and generation (e.g., RNN or LSTM), we choose not to greedily summarize the history of words as a fixed length vector. Instead, we use a convolutional neural network”.  To apply Wang’s replacement of RNN with genCNN to Zhong would result in being able to make a sensory sequence prediction based on a plurality of sensory sequence information windows.  
The combination of Zhong and Wang fails to teach wherein a size of the plurality of history windows increase exponentially, wherein the plurality of history windows do not overlap;Page 6 of 17Patent Application No. 15/888,619Atty. Docket No.: SON1-PAU03Amendment dated December 9, 2021Response to Office Action of September 13, 2021 choose at least one hyperparameter for each of the plurality of history window to allow the system to trade off bias-variance; wherein the temporal convolution includes defining each of the plurality of history windows of events as a ($2^k$)-by-n matrix, where $2^k$ is a number of time steps in each of the plurality of history windows and n is a number of events at each of the time steps, and applying a convolution that is an 1-by-n matrix, where 1 is less than 2^k, wherein a set of the convolutions produces a new set of events; apply a context tree weighting algorithm to the abstract alphabet resulting from the fixed set of discrete classes for each of the plurality of history windows to predict a future discrete sequence.
Agashe teaches wherein a size of the plurality of history windows increase exponentially, wherein the plurality of history windows do not overlap. (Agashe Para [0076] discloses “FIGS. 5A and 5B illustrate an example linear buffer 500 representing a time window comprising exponentially-decaying time intervals and a series of associated counters in accordance with an embodiment of the invention. The time window and counters may track the occurrence of a particular type of event involving a particular asset. In FIG. 5A, element 501 represents the most recent 1-minute time interval beginning at 12:44 and ending at 12:45, element 502 represents a previous 2-minute time interval beginning at 12:42 and ending at 12:44, element 503 represents a previous 4-minute time interval beginning at 12:38 and ending at 12:42, and element 504 represents a previous 8-minute time interval beginning at 12:30 and ending at 12:38. Each element stores a counter associated with the time interval representing the element. At 12:30, the time window was created and the linear buffer was initialized. In FIG. 5A, approximately 15 minutes have elapsed between the creation of the time window and the current time, 12:44:23. A counter associated with the element 501 has been incremented 7 times, indicating that 7 events have occurred in the time interval represented by the element 501. A counter associated with the element 502 has been incremented 34 times, indicating that 34 events have occurred in the time interval represented by the element 502. A counter associated with the element 503 has been incremented 50 times, indicating that 50 events have occurred in the time interval representing the element 503. A counter associated with the element 504 has been incremented 72 times, indicating that 72 events have occurred in the time interval representing the element 504.”)

    PNG
    media_image3.png
    191
    694
    media_image3.png
    Greyscale


Zhong, Wang, and Agashe are analogous art because they are all in the field of machine learning. 
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Agashe with the combination of Zhong and Wang to include saving sequence information in exponentially increasing history windows.  One would have been motivated to do so in order to be able to make more accurate predictions based on more data, while saving on storage resources (Agashe, Para [0075]:  “Exponentially decaying time intervals may allow precise tracking of recent events while conserving storage capacity by passing a record of the occurrence of events to progressively larger time intervals as time progresses and the events become less recent”). 
The combination of Zhong, Wang, and Agashe fails to teach choose at least one hyperparameter for each of the plurality of history window to allow the system to trade off bias-variance; wherein the temporal convolution includes defining each of the plurality of history windows of events as a ($2^k$)-by-n matrix, where $2^k$ is a number of time steps in each of the plurality of history windows and n is a number of events at each of the time steps, and applying a convolution that is an 1-by-n matrix, where 1 is less than 2^k, wherein a set of 
Begleiter teaches and apply a context tree weighting algorithm to an [abstract] alphabet [resulting from the fixed set of discrete classes for each of the plurality of history windows] (Recall above that Wang teaches an abstract alphabet resulting from the fixed set of discrete classes for each of the plurality of history windows.  Begleiter, Page 397 Para 2-3, discloses:  “The second method we consider is Volf's `decomposed CTW', denoted here by DE-CTW' (Volf, 2002). The DE-CTW uses a tree-based hierarchical decomposition of the multi-valued prediction problem into binary problems. Each of the binary problems is solved via a slight variation of the binary CTW algorithm. Let Σ be an alphabet with size k = |Σ|.”  Here, Begleiter discloses applying a context tree weighting algorithm (“DE-CTW”) to an alphabet (“Let Σ be an alphabet with size k = |Σ|”).
Zhong, Wang, Agashe, and Begleiter are analogous art because they are all in the field of machine learning.  
The combination of Zhong, Wang, and Agashe is a sequence prediction algorithm (recall in the motivation statement for the combination of Wang and Zhong that both of these works are concerned with sequence prediction), and Begleiter’s DE-CTW algorithm is also concerned with sequence prediction (Page 397 Para 2: “The de-ctw uses a tree-based hierarchical decomposition of the multi-valued prediction problem into binary problems.”) Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the 
The combination of Zhong, Wang, Agashe, and Begleiter fails to teach choose at least one hyperparameter for each of the plurality of history window to allow the system to trade off bias-variance; wherein the temporal convolution includes defining each of the plurality of history windows of events as a ($2^k$)-by-n matrix, where $2^k$ is a number of time steps in each of the plurality of history windows and n is a number of events at each of the time steps, and applying a convolution that is an 1-by-n matrix, where 1 is less than 2^k, wherein a set of the convolutions produces a new set of events.
Thorhallsson teaches choose at least one hyperparameter [for each of the plurality of history windows] to allow the system to trade off bias-variance.  (Thorhallsson, Intro Para 2, discloses that “This leads to a fundamental tradeoff known as the bias-variance tradeoff which is of paramount importance for optimal choice of the hyperparameters for the learning algorithms”).  *Wang, Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history” (i.e., a plurality of history windows comprising CNNs, and CNNs are a machine learning model and therefore have a bias-variance tradeoff).
Zhong, Wang, Agashe, Begleiter, and Thorhallsson are analogous art because they are all in the field of machine learning.  
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Thorhallsson with the combination of Zhong, Wang, Agashe, and Begleiter to include selecting the optimal choice of hyperparameters, for which the bias-variance tradeoff is of paramount importance.  One would have been motivated to do so to “capture sophisticated relationships in the data while keeping it simple to prevent noise from affecting the outcome” (Thorhallsson, Page 1 Intro, Paragraph 2).
The combination of Zhong, Wang, Agashe, Begleiter, and Thorhallsson fails to teach wherein the temporal convolution includes defining each of the plurality of history windows of events as a ($2^k$)-by-n matrix, where $2^k$ is a number of time steps in each of the plurality of history windows and n is a number of events at each of the time steps, and applying a convolution that is an 1-by-n matrix, where 1 is less than 2^k, wherein a set of the convolutions produces a new set of events.
Pan teaches wherein the [temporal] convolution includes defining each [of the plurality of history windows] of events as a (2^k)-by-n matrix, where 2^k is a number of [time steps in each of the plurality of history windows] and n is a number of events [at each of the time ], and applying a convolution that is an l-by-n matrix, where l is less than 2^k, wherein a set of the convolutions produces a new set of events. (Pan Para [0054] discloses an “n*m feature matrix” in which one dimension “m is the quantity of sub time periods” (i.e., time steps) and the other dimension “n is the quantity of feature types” (i.e., events).  Pan Para [0060] discloses that “The input layer inputs each sample (n*m feature matrix) to a convolutional layer for convolution.”  Pan further discloses “convolution kernel quantity and size can be specified as needed” (Examiner’s note:  a convolution kernel is a matrix) wherein “at least one of a row quantity or a column quantity of the convolution kernel can be a predetermined quantity of feature types”, which is analogous to the instant application, where “n” is the column dimension for both the 2^k-by-n matrix and the l-by-n matrix (Pan describes an “n-by-m feature matrix”, and “a size of each convolution kernel can be n*j”).  Pan further discloses that the remaining dimension of the convolution kernel must be less than the corresponding dimension of the feature matrix (“size of each convolution kernel can be n*j, where j is a positive integer less than m”).  This is analogous to the instant application, where l must be less than 2^k.  Pan Para [0060] further states that “convolutional layer can output 100s feature graphs.” (i.e. the output is a new set of events).  *Wang teaches temporal convolutions in a plurality of history windows:  Wang, Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history.”  Wang, Section 2, defines each of the history windows, which comprise time steps, as Convolutional Neural Networks (alpha-CNN and beta-CNN), which “capture the temporal structure” of the sequence.  
Zhong, Wang, Agashe, Begleiter, and Pan are analogous art because they are all in the field of machine learning.  
Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Pan with the combination of Zhong, Wang, Agashe, and Begleiter to include a convolution kernel wherein at least one of a row quantity or a column quantity of the convolution kernel can be a predetermined quantity of feature types (i.e., events).  One would have been motivated to do so “because feature types (i.e. events) are not continuous, the convolution kernel does not need to be scanned in a distribution direction of each feature type” (Pan, Para [0060], Sentence 3).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Bellemare et. al. (“Skip Context Tree Switching”) discloses using context tree weighting for sequence prediction
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  

Any inquiry concerning this communication or earlier communications from the examiner should be directed to LEONARD A SIEGER whose telephone number is (571)272-9710.  The examiner can normally be reached on M-F 8:00 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann Lo can be reached on (571) 272-9767.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access 
/L.A.S./Examiner, Art Unit 2126    
/ANN J LO/Supervisory Patent Examiner, Art Unit 2126