Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 2021-08-13 is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Response to Amendment
The amendment filed 2021-08-13 has been entered. Claims 1-15 remain pending in the application. Applicant’s amendments to the claims overcome each and every objection previously set forth in the Final Office Action mailed 2021-07-12.
Response to Arguments
Applicant's argument that the amendment to the claim overcomes Holliday has been fully considered and is persuasive.  However, an inspection of the Specification has determined that it raises the issue of new matter.  Fig. 3 appears to show Windows labelled Y, X, Z, X as non-overlapping, but the specification fails to provide language to explicitly state that these windows are limited in such a way.  The Specification in several places only mentions “from a last observed time step”, which could be interpreted to simply mean “now”, and the windows all start from the current time, which is the last observed time.  There is no indication that the “last observed time step” moves with the windows.  Furthermore, the amendment is rendered moot, as Examiner’s acquiescence to some of Applicant’s arguments below necessitated a new 
Applicant’s argument that there is no reason to combine the exponential overlapping windows of Holliday with the non-overlapping windows of Wang has been considered and is persuasive.  Accordingly, the rejection has been withdrawn and new art has been applied.
Applicant’s argument that Heaton does not save observed sensory sequence information has been fully considered but is not persuasive.  Heaton clarifies in Col 4 Lines 40-43, “The sequence of sensory inputs to complete this action is saved by each sensor into memory”. However, this is moot as the rejection has been withdrawn for other reasons and new art has been applied.
Applicant’s argument that there is no motivation to combine Heaton with Wang has been considered and is persuasive.  While one may be able to broadly interpret Heaton’s “learning mode” as “machine learning”, it does not contain actual training, nor make predictions based on historical information.  Therefore, it is not reasonable that one would combine the teachings of Wang with Heaton.  Accordingly, the rejection has been withdrawn and new art has been applied.
Applicant’s argument that Thorhallsson does not teach hyperparameters for each of the plurality of history windows has been fully considered but is not persuasive.  Applicant argues that they are not attacking references individually, but Examiner respectfully disagrees.  Examiner relied on Thorhalsson to teach the concept of hyperparameters to allow the system to trade off bias-variance.  Examiner’s mapping clearly indicates that Wang is relied upon for the plurality of history windows, and each history window comprises a machine learning model, 

Specification
The title of the invention is not descriptive.  A new title is required that is clearly indicative of the invention to which the claims are directed. 

Claim Rejections - 35 USC § 112
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

The following is a quotation of the first paragraph of pre-AIA  35 U.S.C. 112:
The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor of carrying out his invention.

Claims 1, 10, and 15 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention.   The amendment recites time windows that do 
Dependent claims 2-9 and 11-14 are rejected because they inherit the deficiencies of independent claims 1 and 10.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 3, 4 and 7 are rejected under 35 U.S.C. 103 as being unpatentable over Zhong et. al. (“Toward a self-organizing pre-symbolic neural model representing sensorimotor primitives”; hereinafter “Zhong”) in view of Wang et. al. (“genCNN: A Convolutional Architecture for Word Sequence Prediction”; hereinafter “Wang”), Agashe et. al. (US 2014/0095412 A1; hereinafter “Agashe”), and Begleiter et. al. (“On Prediction Using Variable Order Markov Models”; hereinafter “Begleiter”).

As per claim 1, Zhong teaches an artificial intelligence system, comprising: a computing device including at least one processor, one or more data storage devices, and a non-transitory data storage medium interfaced with the at least one processor, the non-transitory data storage medium containing instructions that, when executed cause the at least one processor to (Zhong, Abstract, discloses:  “This model employs neural network architecture incorporating a predictive sensory module based on an RNNPB (Recurrent Neural Network with Parametric Biases) and a horizontal product model.”  Here, Zhong discloses an artificial intelligence system.  Zhong, last sentence of Section 1, discloses:  “Therefore it is sufficient to serve as a neural model that is running in a robot CPU to explain the language development in the cortical areas.” Here, Zhong discloses a computing device including at least one processor (a robot with a CPU), which also implies one or more data storage devices, and a non-transitory data storage medium interfaced with the at least one processor, the non-transitory data storage medium containing instructions that cause the at least one processor to execute.)
save observed sensory sequence information (Zhong, Page 3 Lines 4-10, discloses:  “A ball was attached at the end of the arm so that another robot could obtain the movement by tracking the ball. Our neural network was trained and run on the other NAO, which was called the “observer”. In this way, the observer robot perceived the object movement from its vision passively, so that its network took the object’s visual features and the movements into account”.  Here, Zhong discloses observed sensory sequence information, as the “observer robot” “perceived the object movement from its vision”, wherein “perceived” is another word for observed, and “vision” is sensory information, and “movement” implies this is over time, and thus a sequence of observed images.  Zhong, Section 2.2 also discloses “error back-propagated from a certain sensory information sequence to the PB units”.  While it is likely implied that the information is saved, Zhong explicitly recites to save it (“storing”) this in Conclusion Para 2 Lines 4-5: “These PB units allow for storing multiple sensory sequences.”)
Zhong fails to teach that the sequence information is saved in a plurality of history windows, the plurality of history windows being reverse chronological history windows. Zhong also fails to teach wherein a size of the plurality of history windows increase exponentially, starting from a last observed time step of the previous history window so that the windows do not overlap; apply a function to the observed sensory sequence information in each history window, wherein the function maps the observed sensory sequence information into a fixed set of discrete classes; and apply a context tree weighting algorithm to an alphabet resulting from the fixed set of discrete classes for each of the plurality of history windows to predict a future discrete sequence.
Wang teaches that the sequence information is saved in a plurality of history windows, the plurality of history windows being reverse chronological history windows. (Wang, Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history.”  Wang, Section 2 Last Sentence, discloses “Also distinct from RNN, genCNN gains most of its processing power from the heavy-duty processing units (i.e., CNN and CNNs), which follow a bottom-up information flow and yet can adequately capture the temporal structure in word sequence with its convolutional-gating architecture.”  Examiner’s Note:  Here, alphaCNN and betaCNN are the plurality of history windows, and they are saving word sequence information. These windows go back in time progressively further, and are thus reverse chronological history windows.
Zhong and Wang are analogous art because they are both in the field of machine learning.  
Zhong teaches a method to make a prediction based on historical sequence information, as stated in Zhong Section 3.3:  “In this simulation, the obtained PB units from the previous recognition experiment were used to generate the predicted movements using the prior knowledge of a specific object.”  Wang also teaches a method of making a prediction based on historical sequence information as stated in Wang Abstract:  “Instead, we use a convolutional neural network to predict the next word with the history of words of variable length”, however Wang adds a plurality of history windows.  Zhong discloses using an RNN as stated in Zhong, Abstract:  “This model employs neural network architecture incorporating a predictive sensory module based on an RNNPB (Recurrent Neural Network with Parametric Biases)”, and Wang discloses replacing RNN with their genCNN model as stated in Wang, Abstract:  “Different from previous work on neural network-based language modeling and generation (e.g., RNN or LSTM), we choose not to greedily summarize the history of words as a fixed length vector. Instead, we use a convolutional neural network”.  To apply Wang’s replacement of RNN with genCNN to Zhong would result in being able to make a sensory sequence prediction based on a plurality of 
The combination of Zhong and Wang fails to teach wherein a size of the plurality of history windows increase exponentially, starting from a last observed time step of the previous history window so that the windows do not overlap. The combination of Zhong and Wang also fails to teach apply a function to the observed sensory sequence information in each history window, wherein the function maps the observed sensory sequence information into a fixed set of discrete classes; and apply a context tree weighting algorithm to an alphabet resulting from the fixed set of discrete classes for each of the plurality of history windows to predict a future discrete sequence.
Agashe teaches wherein a size of the plurality of history windows increase exponentially, starting from a last observed time step of the previous history window so that (Agashe Para [0076] discloses “FIGS. 5A and 5B illustrate an example linear buffer 500 representing a time window comprising exponentially-decaying time intervals and a series of associated counters in accordance with an embodiment of the invention. The time window and counters may track the occurrence of a particular type of event involving a particular asset. In FIG. 5A, element 501 represents the most recent 1-minute time interval beginning at 12:44 and ending at 12:45, element 502 represents a previous 2-minute time interval beginning at 12:42 and ending at 12:44, element 503 represents a previous 4-minute time interval beginning at 12:38 and ending at 12:42, and element 504 represents a previous 8-minute time interval beginning at 12:30 and ending at 12:38. Each element stores a counter associated with the time interval representing the element. At 12:30, the time window was created and the linear buffer was initialized. In FIG. 5A, approximately 15 minutes have elapsed between the creation of the time window and the current time, 12:44:23. A counter associated with the element 501 has been incremented 7 times, indicating that 7 events have occurred in the time interval represented by the element 501. A counter associated with the element 502 has been incremented 34 times, indicating that 34 events have occurred in the time interval represented by the element 502. A counter associated with the element 503 has been incremented 50 times, indicating that 50 events have occurred in the time interval representing the element 503. A counter associated with the element 504 has been incremented 72 times, indicating that 72 events have occurred in the time interval representing the element 504.”)

    PNG
    media_image1.png
    191
    694
    media_image1.png
    Greyscale


Zhong, Wang, and Agashe are analogous art because they are all in the field of machine learning.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Agashe with the combination of Zhong and Wang to include saving sequence information in exponentially increasing history windows.  One would have been motivated to do so in order to be able to make more accurate predictions based on more data, while saving on storage resources (Agashe, Para [0075]:  “Exponentially decaying time intervals may allow precise tracking of recent events while conserving storage capacity by passing a record of the occurrence of events to progressively larger time intervals as time progresses and the events become less recent”). 
The combination of Zhong, Wang, and Agashe fails to teach apply a function to the observed sensory sequence information in each history window, wherein the function maps the observed sensory sequence information into a fixed set of discrete classes; and apply a context tree weighting algorithm to an alphabet resulting from the fixed set of discrete classes for each of the plurality of history windows to predict a future discrete sequence.
Begleiter teaches apply a function to the observed [sensory] sequence information [in each history window], wherein the function maps the observed [sensory] sequence information Begleiter, Section 3.4 Paragraph 2, discloses, for sequence information (“each alphabet symbol”), a function to map the information into a fixed set of discrete classes (“concatenating binary words of size k, one for each alphabet symbol”), wherein the function is converting the symbols to binary words, and these “binary words” comprise a fixed set of 2k discrete classes.  *Zhong discloses that the sequence information is sensory:   Zhong, Conclusion Para 2 Lines 4-5: “These PB units allow for storing multiple sensory sequences.”)  *Wang discloses saving sequence information in a plurality of history windows:  Wang, Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history.”
and apply a context tree weighting algorithm to an alphabet resulting from the fixed set of discrete classes [for each of the plurality of history windows] to predict a future discrete sequence (Begleiter, Section 3.4 Paragraph 2, discloses a “standard binary ctw algorithm over a binary representation of the sequence”.  Begleiter, in the Abstract, identifies CTW as a sequence prediction algorithm: “prediction algorithms, including Context Tree Weighting (CTW)”.  Therefore, Begleiter discloses apply a context tree weighting algorithm to predict a future discrete sequence.  This is applied to an alphabet resulting from the fixed set of discrete classes, where this alphabet is the “binary representation” of Begleiter’s “each alphabet symbol” (which was previously mapped to “sequence information”), as the “binary representation” is resulting from the fixed set of discrete classes, which as shown above is the “binary words of size k”, comprising 2k discrete classes.  To clarify, the “alphabet” is resulting from the “fixed set of discrete classes” because it is the “fixed set of discrete classes” in binary form.   *Wang discloses saving sequence information in a plurality of history windows:  Wang, Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history.”
Zhong, Wang, Agashe, and Begleiter are analogous art because they are all in the field of machine learning.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Begleiter with the combination of Zhong, Wang, and Agashe to include applying a CTW algorithm to a fixed set of discrete classes mapped from an alphabet.  One would have been motivated to do so for the purpose of “extending the ctw algorithm for large alphabets” (Begleiter, Section 3.4, Paragraph 1).

As per Claim 3, the combination of Zhong, Wang, Agashe, and Begleiter teaches the artificial intelligence system of claim 1 as shown above, as well as wherein the function is a feature-wise maximum over time steps in one or more of the plurality of history windows. (Wang, Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history.”  Wang, Section 3.2 Line 1, discloses another embodiment in which “Previous CNNs, including those for NLP tasks (Hu et al., 2014; Kalchbrenner et al., 2014), take a straightforward convolution-pooling strategy, in which the ‘fusion’ decisions (e.g., selecting the largest one in max-pooling) are made based on the values of feature-maps”.  Examiner’s Note:  Here, Wang is describing a plurality of history windows alphaCNN and betaCNN that comprise time steps.  Each of these history windows comprises a convolutional neural network (CNN).  These CNNs comprise a “convolution-pooling strategy” (i.e., function), such as “selecting the largest one” (i.e., maximum) “based on the values of feature-maps” (i.e. feature-wise). This function is done in each CNN, and is thus applied over time steps in the plurality of history windows). 
Zhong, Agashe, Begleiter, and Wang are analogous art because they are all in the field of machine learning.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine this teaching of Wang with the existing combination of Zhong, Agashe, Begleiter, and the primary teaching of Wang to include max-pooling (selecting the largest one) based on the values of feature maps.  One would have been motivated to do so because it is a “straightforward” strategy (Wang, Section 3.2 Line 1).

As per Claim 4, the combination of Zhong, Wang, Agashe, and Begleiter teaches the artificial intelligence system of claim 3 as shown above, as well as wherein the observed sensory sequence information is a binary event (Begleiter, Section 3.1, Paragraph 1, discloses “In this section we consider the original ctw algorithm for binary alphabets”.  *Zhong, Conclusion Para 2 Lines 4-5, discloses that the information is sensory: “These PB units allow for storing multiple sensory sequences.”)  Zhong, Conclusion Para 2 Lines 4-5, discloses that the information is sensory: “These PB units allow for storing multiple sensory sequences.”).
Zhong, Wang, Agashe, and Begleiter are analogous art because they are all in the field of machine learning.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Begleiter with the combination of Zhong, Wang, and Agashe to include applying a CTW algorithm to a fixed set of discrete classes mapped from an alphabet.  One would have been motivated to do so for the purpose of “extending the ctw algorithm for large alphabets” (Begleiter, Section 3.4, Paragraph 1).


As per Claim 7, the combination of Zhong, Wang, Agashe, and Begleiter teaches the artificial intelligence system of claim 1 as shown above, as well as wherein the instructions cause the at least one processor to perform a temporal convolution in a deep neural network to map observed sensory sequence information from the plurality of history windows to symbols. (Wang, Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history.”  Wang, Section 2, defines each of the history windows as Convolutional Neural Networks (alpha-CNN and beta-CNN), which “capture the temporal structure” of the sequence.  Wang Figures 2 and 4 illustrate that each CNN makes a prediction of the next word (i.e., symbol).  Zhong, last sentence of Section 1, discloses a processor:  “Therefore it is sufficient to serve as a neural model that is running in a robot CPU to explain the language development in the cortical areas.” Zhong, Conclusion Para 2 Lines 4-5, discloses that the information is sensory: “These PB units allow for storing multiple sensory sequences.”)  
Zhong and Wang are analogous art because they are both in the field of machine learning.  
Zhong teaches a method to make a prediction based on historical sequence information, as stated in Zhong Section 3.3:  “In this simulation, the obtained PB units from the previous recognition experiment were used to generate the predicted movements using the prior knowledge of a specific object.”  Wang also teaches a method of making a prediction based on historical sequence information as stated in Wang Abstract:  “Instead, we use a convolutional neural network to predict the next word with the history of words of variable length”, however Wang adds a plurality of history windows.  Zhong discloses using an RNN as stated in Zhong, Abstract:  “This model employs neural network architecture incorporating a predictive sensory module based on an RNNPB (Recurrent Neural Network with Parametric Biases)”, and Wang discloses replacing RNN with their genCNN model as stated in Wang, Abstract:  “Different from previous work on neural network-based language modeling and generation (e.g., RNN or LSTM), we choose not to greedily summarize the history of words as a fixed length vector. Instead, we use a convolutional neural network”.  To apply Wang’s replacement of RNN with genCNN to Zhong would result in being able to make a sensory sequence prediction based on a plurality of sensory sequence information windows.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the .

Claim 2 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Zhong, Wang, Agashe, and Begleiter, further in view of Thorhallson et. al. (“Visualizing the Bias-Variance Tradeoff”; hereinafter, “Thorhallsson”).
As per Claim 2, the combination of Zhong, Wang, Agashe, and Begleiter teaches the artificial intelligence system of claim 1 as shown above.  However, the combination of Zhong, Wang, Agashe, and Begleiter fails to teach wherein the instructions cause the at least one processor to choose at least one hyperparameter for each of the plurality of history windows to allow the system to trade off bias-variance.
Thorhallsson teaches wherein [the instructions cause the at least one processor] to choose at least one hyperparameter [for each of the plurality of history windows] to allow the system to trade off bias-variance.  (Thorhallsson, Intro Para 2, discloses that “This leads to a fundamental tradeoff known as the bias-variance tradeoff which is of paramount importance for optimal choice of the hyperparameters for the learning algorithms”).  * Zhong, last sentence of Section 1, discloses a processor:  “Therefore it is sufficient to serve as a neural model that is running in a robot CPU to explain the language development in the cortical areas.” *Wang, Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history” (i.e., a plurality of history windows comprising CNNs, and CNNs are a machine learning model and therefore have a bias-variance tradeoff).
 Zhong, Wang, Agashe, Begleiter, and Thorhallsson are analogous art because they are all in the field of machine learning.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Thorhallsson with the combination of Zhong, Wang, Agashe, and Begleiter to include selecting the optimal choice of hyperparameters, for which the bias-variance tradeoff is of paramount importance.  One would have been motivated to do so to “capture sophisticated relationships in the data while keeping it simple to prevent noise from affecting the outcome” (Thorhallsson, Intro, Paragraph 2).

Claims 5 and 6 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Zhong, Wang, Agashe, and Begleiter, further in view of Campbell et. al. (US PGPub US 2019/0012371 A1 ; hereinafter, “Campbell”).

As per Claim 5, the combination of Zhong, Wang, Agashe, and Begleiter teaches the artificial intelligence system of claim 1 as shown above.  However, the combination of Zhong, Wang, Agashe, and Begleiter fails to teach wherein the instructions cause the at least one processor to use a deep neural network classifier to map arbitrary length histories to a second alphabet having smaller length than the alphabet of the arbitrary length histories as an input sequence for the context tree weighting algorithm.
Campbell teaches wherein [the instructions cause the at least one processor] to use a deep neural network classifier to map [arbitrary length] histories to a second alphabet having smaller length than the alphabet of the [arbitrary length] histories [as an input sequence for the context tree weighting algorithm]. (Campbell Para [0077] discloses “The belief tracker component 408 is configured to identify what table(s) of the database and what column(s) of the tables of the database are being implicated by the user's utterance. In particular, belief tracker component 408 implements a neural network (e.g., a recurrent neural network) that is configured to map dialog history to belief states. A belief state is a distribution over user goals and dialog states (e.g., context). The output of the belief tracker is an encoding of both the current user utterance and the history of utterances of user-system utterances. The user goal is related to: one or more tables of the database and their respective columns of metadata, such as names and data types; and vocabulary of columns (e.g., slots). The belief tracker component is configured to receive the feature vector as an input from the feature extractor component 404, concatenate the feature vector with the encoded dialog history that was generated by the context encoding component 406, and produce a probability distribution vector over the columns of the multiple tables of the database 410.”  Examiner’s Note:  A recurrent neural network is a type of deep neural network.  Campbell is using this to “map” (i.e., classify) “dialog history” to “belief states”.  The “belief states” are a smaller alphabet than the “user utterances” that comprise the “dialog history”, as they are “columns of the multiple tables of the database”).  * Zhong, last sentence of Section 1, discloses a processor:  “Therefore it is sufficient to serve as a neural model that is running in a robot CPU to explain the language development in the cortical areas.” *Wang discloses arbitrary length histories with the repeating betaCNN windows:  Wang, Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history.”  *Begleiter discloses mapping an alphabet before using as input to a CTW algorithm:  Begleiter, Section 3.4 Paragraph 2, discloses “concatenating binary words of size k, one for each alphabet symbol” and uses this for “application of the standard binary ctw algorithm over a binary representation of the sequence”.  
Zhong, Wang, Agashe, Begleiter, and Campbell are analogous art because they are all in the field of machine learning.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Campbell with the combination of Zhong, Wang, Agashe, and Begleiter to include mapping sequence histories to a smaller set of states.  One would have been motivated to do so to extract underlying context or meaning from a sequence, as Campbell states:  “map dialog history to belief states“. (Campbell Para [0077]).

As per Claim 6, the combination of Zhong, Wang, Agashe, Begleiter, and Campbell teaches the artificial intelligence system of claim 5 as shown above, as well as wherein a long short-term memory-based sequence to symbol method is used to map the arbitrary length histories to a second alphabet having smaller length than the alphabet of the arbitrary length histories. (Campbell Para [0077] discloses “The belief tracker component 408 is configured to identify what table(s) of the database and what column(s) of the tables of the database are being implicated by the user's utterance. In particular, belief tracker component 408 implements a neural network (e.g., a recurrent neural network) that is configured to map dialog history to belief states. A belief state is a distribution over user goals and dialog states (e.g., context). The output of the belief tracker is an encoding of both the current user utterance and the history of utterances of user-system utterances. The user goal is related to: one or more tables of the database and their respective columns of metadata, such as names and data types; and vocabulary of columns (e.g., slots). The belief tracker component is configured to receive the feature vector as an input from the feature extractor component 404, concatenate the feature vector with the encoded dialog history that was generated by the context encoding component 406, and produce a probability distribution vector over the columns of the multiple tables of the database 410.”  Examiner’s Note:  A recurrent neural network is a type of deep neural network.  Campbell is using this to “map” (i.e., classify) “dialog history” to “belief states”.  The “belief states” are a smaller alphabet than the “user utterances” that comprise the “dialog history”, as they are “columns of the multiple tables of the database.”  Furthermore, Campbell, Para [0071] first sentence, discloses that “dialog manager component 412 and belief tracker component 408 are trained via the supervised learning component 602.” Campbell Para [0074] then discloses that “In some embodiments of the present invention, the supervised learning component 602 can be represented using a variety of suitable techniques, such as for example, multiplayer perceptron (MLP) representation, gated recurrent unit (GRU) representation, long-short term memory (LSTM) representation, and/or a memory network representation”).
Zhong, Wang, Agashe, Begleiter, and Campbell are analogous art because they are all in the field of machine learning.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Campbell with the combination of Zhong, Wang, Agashe, and Begleiter to include mapping sequence histories to a smaller set of states.  One would have been motivated to do so to extract underlying context or meaning from a sequence, as Campbell states:  “map dialog history to belief states“. (Campbell Para [0077]).

Claims 8 and 9 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Zhong, Wang, Agashe, and Begleiter, further in view of Pan et. al. (US PGPub US 2020/0211106 A1 ; hereinafter, “Pan”).
As per Claim 8, the combination of Zhong, Wang, Agashe, and Begleiter teaches the artificial intelligence system of claim 7 as shown above.  However, the combination of Zhong, Wang, Agashe, and Begleiter fails to teach wherein the temporal convolution includes defining each of the plurality of history windows of events as a (2^k)-by-n matrix, where 2^k is a number of time steps in each of the plurality of history windows and n is a number of events at each of 
Pan teaches wherein the [temporal] convolution includes defining each [of the plurality of history windows] of events as a (2^k)-by-n matrix, where 2^k is a number of [time steps in each of the plurality of history windows] and n is a number of events [at each of the time steps], and applying a convolution that is an l-by-n matrix, where l is less than 2^k, wherein a set of the convolutions produces a new set of events. (Pan Para [0054] discloses an “n*m feature matrix” in which one dimension “m is the quantity of sub time periods” (i.e., time steps) and the other dimension “n is the quantity of feature types” (i.e., events).  Pan Para [0060] discloses that  “The input layer inputs each sample (n*m feature matrix) to a convolutional layer for convolution.”  Pan further discloses “convolution kernel quantity and size can be specified as needed” (Examiner’s note:  a convolution kernel is a matrix) wherein “at least one of a row quantity or a column quantity of the convolution kernel can be a predetermined quantity of feature types”, which is analogous to the instant application, where “n” is the column dimension for both the 2^k-by-n matrix and the l-by-n matrix (Pan describes an “n-by-m feature matrix”, and “a size of each convolution kernel can be n*j”).  Pan further discloses that the remaining dimension of the convolutional kernel must be less than the corresponding dimension of the feature matrix (“size of each convolution kernel can be n*j, where j is a positive integer less than m”).  This is analogous to the instant application, where l must be less than 2^k.  Pan Para [0060] further states that “convolutional layer can output 100s feature graphs.” (i.e. the output is a new set of events).  *Wang teaches temporal convolutions in a plurality of history windows:  Wang, Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history.”  Wang, Section 2, defines each of the history windows, which comprise time steps, as Convolutional Neural Networks (alpha-CNN and beta-CNN), which “capture the temporal structure” of the sequence. 
Zhong, Wang, Agashe, Begleiter, and Pan are analogous art because they are all in the field of machine learning.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Pan with the combination of Zhong, Wang, Agashe, and Begleiter to include a convolution kernel wherein at least one of a row quantity or a column quantity of the convolution kernel can be a predetermined quantity of feature types (i.e., events).  One would have been motivated to do so “because feature types (i.e. events) are not continuous, the convolution kernel does not need to be scanned in a distribution direction of each feature type”. (Pan, Para [0060], Sentence 3)

As per Claim 9, the combination of Zhong, Wang, Agashe, Begleiter, and Pan teaches the artificial intelligence system of claim 8 as shown above.  The combination of Zhong, Wang, Agashe, Begleiter, and Pan further teaches wherein the convolution is applied to each of the plurality of history windows. (Wang, Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history. Wang Section 2 describes each of the history windows as Convolutional Neural Networks (CNNs)).
Zhong and Wang are analogous art because they are both in the field of machine learning.  
Zhong teaches a method to make a prediction based on historical sequence information, as stated in Zhong Section 3.3:  “In this simulation, the obtained PB units from the previous recognition experiment were used to generate the predicted movements using the prior knowledge of a specific object.”  Wang also teaches a method of making a prediction based on historical sequence information as stated in Wang Abstract:  “Instead, we use a convolutional neural network to predict the next word with the history of words of variable length”, however Wang adds a plurality of history windows.  Zhong discloses using an RNN as stated in Zhong, Abstract:  “This model employs neural network architecture incorporating a predictive sensory module based on an RNNPB (Recurrent Neural Network with Parametric Biases)”, and Wang discloses replacing RNN with their genCNN model as stated in Wang, Abstract:  “Different from previous work on neural network-based language modeling and generation (e.g., RNN or LSTM), we choose not to greedily summarize the history of words as a fixed length vector. Instead, we use a convolutional neural network”.  To apply Wang’s replacement of RNN with genCNN to Zhong would result in being able to make a sensory sequence prediction based on a plurality of sensory sequence information windows.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Zhong and Wang.  One would have been motivated to do so in order to make more accurate predictions by utilizing more historical data (Wang, Abstract:  “We argue that our .
 
Claims 10, 11, 12, and 13 are rejected under 35 U.S.C. 103 as being unpatentable over Zhong in view of Wang, Agashe, Begleiter, Thorhallsson, and Campbell.  
As per Claim 10, Zhong teaches an artificial intelligence system, comprising: a computing device including at least one processor, one or more data storage devices, and a non-transitory data storage medium interfaced with the at least one processor, the non-transitory data storage medium containing instructions that, when executed cause the at least one processor to (Zhong, Abstract, discloses:  “This model employs neural network architecture incorporating a predictive sensory module based on an RNNPB (Recurrent Neural Network with Parametric Biases) and a horizontal product model.”  Here, Zhong discloses an artificial intelligence system.  Zhong, last sentence of Section 1, discloses:  “Therefore it is sufficient to serve as a neural model that is running in a robot CPU to explain the language development in the cortical areas.” Here, Zhong discloses a computing device including at least one processor (a robot with a CPU), which also implies one or more data storage devices, that cause the at least one processor to execute.)
save observed sensory sequence information (Zhong, Page 3 Lines 4-10, discloses:  “A ball was attached at the end of the arm so that another robot could obtain the movement by tracking the ball. Our neural network was trained and run on the other NAO, which was called the “observer”. In this way, the observer robot perceived the object movement from its vision passively, so that its network took the object’s visual features and the movements into account”.  Here, Zhong discloses observed sensory sequence information, as the “observer robot” “perceived the object movement from its vision”, wherein “perceived” is another word for observed, and “vision” is sensory information, and “movement” implies this is over time, and thus a sequence of observed images.  Zhong, Section 2.2 also discloses “error back-propagated from a certain sensory information sequence to the PB units”.  While it is likely implied that the information is saved, Zhong explicitly recites to save it (“storing”) this in Conclusion Para 2 Lines 4-5: “These PB units allow for storing multiple sensory sequences.”)
Zhong fails to teach that the sequence information is saved in a plurality of history windows, the plurality of history windows being reverse chronological history windows. Zhong also fails to teach wherein a size of the plurality of history windows increase exponentially, starting from a last observed time step of the previous history window so that the windows do not overlap; apply a function to the observed sensory sequence information in each history 
Wang teaches that the sequence information is saved in a plurality of history windows, the plurality of history windows being reverse chronological history windows. (Wang, Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history. ”  Wang, Section 2 Last Sentence, discloses “Also distinct from RNN, genCNN gains most of its processing power from the heavy-duty processing units (i.e., CNN and CNNs), which follow a bottom-up information flow and yet can adequately capture the temporal structure in word sequence with its convolutional-gating architecture.”  Examiner’s Note:  Here, alphaCNN and betaCNN are the plurality of history windows, and they are saving word sequence information. These windows go back in time progressively further, and are thus reverse chronological history windows).
Zhong and Wang are analogous art because they are both in the field of machine learning.  

The combination of Zhong and Wang fails to teach wherein a size of the plurality of history windows increase exponentially, starting from a last observed time step of the previous The combination of Zhong and Wang also fails to teach apply a function to the observed sensory sequence information in each history window, wherein the function maps the observed sensory sequence information into a fixed set of discrete classes, fixed for all of the plurality of history windows; choose at least one hyperparameter for each of the plurality of history windows to allow the system to trade off bias-variance; and apply a context tree weighting algorithm to an alphabet resulting from the fixed set of discrete classes for each of the plurality of history windows to predict a future discrete sequence; and use a deep neural network classifier map arbitrary length histories to a second alphabet having smaller length than the alphabet of the arbitrary length histories as an input sequence for the context tree weighting algorithm.
Agashe teaches wherein a size of the plurality of history windows increase exponentially, starting from a last observed time step of the previous history window so that the windows do not overlap. (Agashe Para [0076] discloses “FIGS. 5A and 5B illustrate an example linear buffer 500 representing a time window comprising exponentially-decaying time intervals and a series of associated counters in accordance with an embodiment of the invention. The time window and counters may track the occurrence of a particular type of event involving a particular asset. In FIG. 5A, element 501 represents the most recent 1-minute time interval beginning at 12:44 and ending at 12:45, element 502 represents a previous 2-minute time interval beginning at 12:42 and ending at 12:44, element 503 represents a previous 4-minute time interval beginning at 12:38 and ending at 12:42, and element 504 represents a previous 8-minute time interval beginning at 12:30 and ending at 12:38. Each element stores a counter associated with the time interval representing the element. At 12:30, the time window was created and the linear buffer was initialized. In FIG. 5A, approximately 15 minutes have elapsed between the creation of the time window and the current time, 12:44:23. A counter associated with the element 501 has been incremented 7 times, indicating that 7 events have occurred in the time interval represented by the element 501. A counter associated with the element 502 has been incremented 34 times, indicating that 34 events have occurred in the time interval represented by the element 502. A counter associated with the element 503 has been incremented 50 times, indicating that 50 events have occurred in the time interval representing the element 503. A counter associated with the element 504 has been incremented 72 times, indicating that 72 events have occurred in the time interval representing the element 504.”)

    PNG
    media_image1.png
    191
    694
    media_image1.png
    Greyscale


Zhong, Wang, and Agashe are analogous art because they are all in the field of machine learning.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Agashe with the combination of Zhong and Wang to include saving sequence information in exponentially increasing history windows.  One would have been motivated to do so in order to be able to make more accurate predictions based on more data, while saving on storage resources 
The combination of Zhong, Wang, and Agashe fails to teach apply a function to the observed sensory sequence information in each history window, wherein the function maps the observed sensory sequence information into a fixed set of discrete classes, fixed for all of the plurality of history windows; and apply a context tree weighting algorithm to an alphabet resulting from the fixed set of discrete classes for each of the plurality of history windows to predict a future discrete sequence. The combination of Zhong, Wang, and Agashe also fails to teach choose at least one hyperparameter for each of the plurality of history windows to allow the system to trade off bias-variance; and use a deep neural network classifier map arbitrary length histories to a second alphabet having smaller length than the alphabet of the arbitrary length histories as an input sequence for the context tree weighting algorithm.
Begleiter teaches apply a function to the observed [sensory] sequence information [in each history window], wherein the function maps the observed [sensory] sequence information into a fixed set of discrete classes, fixed [for all of the plurality of history windows] (Begleiter, Section 3.4 Paragraph 2, discloses, for sequence information (“each alphabet symbol”), a function to map the information into a fixed set of discrete classes (“concatenating binary words of size k, one for each alphabet symbol”), wherein the function is converting the symbols to binary words, and these “binary words” comprise a fixed set of 2k discrete classes.  * Zhong, Conclusion Para 2 Lines 4-5, discloses that the information is sensory: “These PB units allow for storing multiple sensory sequences.”)  *Wang discloses saving sequence information in a plurality of history windows:  Wang, Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history.”
and apply a context tree weighting algorithm to an alphabet resulting from the fixed set of discrete classes [for each of the plurality of history windows] to predict a future discrete sequence (Begleiter, Section 3.4 Paragraph 2, a “standard binary ctw algorithm over a binary representation of the sequence”.  Begleiter, in the Abstract, identifies CTW as a sequence prediction algorithm: “prediction algorithms, including Context Tree Weighting (CTW)”.  Therefore, Begleiter discloses apply a context tree weighting algorithm to predict a future discrete sequence.  This is applied to an alphabet resulting from the fixed set of discrete classes, where this alphabet is the “binary representation” of Begleiter’s “each alphabet symbol” (which was previously mapped to “sequence information”), as the “binary representation” is resulting from the fixed set of discrete classes, which as shown above is the “binary words of size k”, comprising 2k discrete classes.  To clarify, the “alphabet” is resulting from the “fixed set of discrete classes” because it is the “fixed set of discrete classes” in binary form.   *Wang discloses saving sequence information in a plurality of history windows:  Wang, Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history.”

The combination of Zhong, Wang, Agashe, and Begleiter fails to teach choose at least one hyperparameter for each of the plurality of history windows to allow the system to trade off bias-variance.  The combination of Zhong, Wang, Agashe, and Begleiter also fails to teach and use a deep neural network classifier map arbitrary length histories to a second alphabet having smaller length than the alphabet of the arbitrary length histories as an input sequence for the context tree weighting algorithm.
Thorhallsson teaches choose at least one hyperparameter [for each of the plurality of history windows] to allow the system to trade off bias-variance.  (Thorhallsson, Intro Para 2, discloses that “This leads to a fundamental tradeoff known as the bias-variance tradeoff which is of paramount importance for optimal choice of the hyperparameters for the learning algorithms”).  *Wang, Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history” (i.e., a plurality of history windows comprising CNNs, and CNNs are a machine learning model and therefore have a bias-variance tradeoff).

The combination of Zhong, Wang, Agashe, Begleiter, and Thorhallsson fails to teach and use a deep neural network classifier map arbitrary length histories to a second alphabet having smaller length than the alphabet of the arbitrary length histories as an input sequence for the context tree weighting algorithm.  
Campbell teaches and use a deep neural network classifier to map [arbitrary length] histories to a second alphabet having smaller length than the alphabet of the [arbitrary length] histories [as an input sequence for the context tree weighting algorithm]. (Campbell Para [0077] discloses “The belief tracker component 408 is configured to identify what table(s) of the database and what column(s) of the tables of the database are being implicated by the user's utterance. In particular, belief tracker component 408 implements a neural network (e.g., a recurrent neural network) that is configured to map dialog history to belief states. A belief state is a distribution over user goals and dialog states (e.g., context). The output of the belief tracker is an encoding of both the current user utterance and the history of utterances of user-system utterances. The user goal is related to: one or more tables of the database and their respective columns of metadata, such as names and data types; and vocabulary of columns (e.g., slots). The belief tracker component is configured to receive the feature vector as an input from the feature extractor component 404, concatenate the feature vector with the encoded dialog history that was generated by the context encoding component 406, and produce a probability distribution vector over the columns of the multiple tables of the database 410.”  Examiner’s Note:  A recurrent neural network is a type of deep neural network.  Campbell is using this to “map” (i.e., classify) “dialog history” to “belief states”.  The “belief states” are a smaller alphabet than the “user utterances” that comprise the “dialog history”, as they are “columns of the multiple tables of the database”).  *Wang discloses arbitrary length histories with the repeating betaCNN windows:  Wang, Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history.” *Begleiter discloses mapping an alphabet before using as input to a CTW algorithm:  Begleiter, Section 3.4 Paragraph 2, discloses “concatenating binary words of size k, one for each alphabet symbol” and uses this for “application of the standard binary ctw algorithm over a binary representation of the sequence”.  
 Zhong, Wang, Agashe, Begleiter, Thorhallsson, and Campbell are analogous art because they are all in the field of machine learning.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Campbell with the combination of Zhong, Wang, Agashe, Begleiter, and Thorhallsson to include mapping sequence histories to a smaller set of states.  One would have 

As per Claim 11, the combination of Zhong, Wang, Agashe, Begleiter, Thorhallsson, and Campbell teaches the artificial intelligence system of claim 10 as shown above, as well as wherein a long short-term memory-based sequence to symbol method is used to map the arbitrary length histories to the minimal output symbol alphabet.  (Campbell Para [0077] discloses “The belief tracker component 408 is configured to identify what table(s) of the database and what column(s) of the tables of the database are being implicated by the user's utterance. In particular, belief tracker component 408 implements a neural network (e.g., a recurrent neural network) that is configured to map dialog history to belief states. A belief state is a distribution over user goals and dialog states (e.g., context). The output of the belief tracker is an encoding of both the current user utterance and the history of utterances of user-system utterances. The user goal is related to: one or more tables of the database and their respective columns of metadata, such as names and data types; and vocabulary of columns (e.g., slots). The belief tracker component is configured to receive the feature vector as an input from the feature extractor component 404, concatenate the feature vector with the encoded dialog history that was generated by the context encoding component 406, and produce a probability distribution vector over the columns of the multiple tables of the database 410.”  Examiner’s Note:  A recurrent neural network is a type of deep neural network.  Campbell is using this to “map” (i.e., classify) “dialog history” to “belief states”.  The “belief states” are a smaller alphabet than the “user utterances” that comprise the “dialog history”, as they are “columns of the multiple tables of the database.”  Furthermore, Campbell, Para [0071] first sentence, discloses that “dialog manager component 412 and belief tracker component 408 are trained via the supervised learning component 602.” Campbell Para [0074] then discloses that “In some embodiments of the present invention, the supervised learning component 602 can be represented using a variety of suitable techniques, such as for example, multiplayer perceptron (MLP) representation, gated recurrent unit (GRU) representation, long-short term memory (LSTM) representation, and/or a memory network representation”). 
Zhong, Wang, Agashe, Begleiter, and Campbell are analogous art because they are all in the field of machine learning.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Campbell with the combination of Zhong, Wang, Agashe, and Begleiter to include mapping sequence histories to a smaller set of states.  One would have been motivated to do so to extract underlying context or meaning from a sequence, as Campbell states:  “map dialog history to belief states“. (Campbell Para [0077]).

As per Claim 12, the combination of Zhong, Wang, Agashe, Begleiter, Thorhallsson, and Campbell teaches the artificial intelligence system of claim 10 as shown above, as well as wherein the observed sensory sequence information is a binary event. (Begleiter, Section 3.1, Paragraph 1, discloses “In this section we consider the original ctw algorithm for binary alphabets”.  Zhong, Conclusion Para 2 Lines 4-5, discloses that the information is sensory: “These PB units allow for storing multiple sensory sequences.”) 


As per Claim 13, the combination of Zhong, Wang, Agashe, Begleiter, Thorhallsson, and Campbell teaches the artificial intelligence system of claim 10 as shown above, as well as wherein the instructions cause the at least one processor to perform a temporal convolution in a deep neural network to map observed sensory sequence information from the plurality of history windows to symbols. (Wang, Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history.”  Wang, Section 2, defines each of the history windows as Convolutional Neural Networks (alpha-CNN and beta-CNN), which “capture the temporal structure” of the sequence.  Wang Figures 2 and 4 illustrate that each CNN makes a prediction of the next word (i.e., symbol).  Zhong, last sentence of Section 1, discloses a processor:  “Therefore it is sufficient to serve as a neural model that is running in a robot CPU to explain the language development in the cortical areas.” Zhong, last sentence of Section 1, discloses a processor:  “Therefore it is sufficient to serve as a neural model that is running in a robot CPU to explain the language development in the cortical areas.” Zhong, Conclusion Para 2 Lines 4-5, discloses that the information is sensory: “These PB units allow for storing multiple sensory sequences.”)  

Claim 14 is rejected under 35 U.S.C. 103 as being unpatentable over the combination of Zhong, Wang, Agashe, Begleiter, Thorhallsson, Campbell, further in view of Pan.  
As per Claim 14, the combination of Zhong, Wang, Agashe, Begleiter, Thorhallsson, and Campbell teaches the artificial intelligence system of claim 10 as shown above.  The combination of Zhong, Wang, Agashe, Begleiter, Thorhallsson, and Campbell fails to teach wherein the temporal convolution includes defining each of the plurality of history windows of events as a 2^k-by-n matrix, where 2^k is a number of time steps in each of the plurality of history windows and n is a number of events at each of the time steps, and applying a convolution that is an l-by-n matrix, where l is less than 2^k, wherein a set of the convolutions produces a new set of events.
Pan teaches wherein the [temporal] convolution includes defining each [of the plurality of history windows] of events as a (2^k)-by-n matrix, where 2^k is a number of [time steps in each of the plurality of history windows] and n is a number of events [at each of the time steps], and applying a convolution that is an l-by-n matrix, where l is less than 2^k, wherein a set of the convolutions produces a new set of events. (Pan Para [0054] discloses an “n*m feature matrix” in which one dimension “m is the quantity of sub time periods” (i.e., time steps) and the other dimension “n is the quantity of feature types” (i.e., events).  Pan Para [0060] discloses that  “The input layer inputs each sample (n*m feature matrix) to a convolutional layer for convolution.”  Pan further discloses “convolution kernel quantity and size can be specified as needed” (Examiner’s note:  a convolution kernel is a matrix) wherein “at least one of a row quantity or a column quantity of the convolution kernel can be a predetermined quantity of feature types”, which is analogous to the instant application, where “n” is the column dimension for both the 2^k-by-n matrix and the l-by-n matrix (Pan describes an “n-by-m feature matrix”, and “a size of each convolution kernel can be n*j”).  Pan further discloses that the remaining dimension of the convolution kernel must be less than the corresponding dimension of the feature matrix (“size of each convolution kernel can be n*j, where j is a positive integer less than m”).  This is analogous to the instant application, where l must be less than 2^k.  Pan Para [0060] further states that “convolutional layer can output 100s feature graphs.” (i.e. the output is a new set of events).  *Wang teaches temporal convolutions in a plurality of history windows:  Wang, Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history.”  Wang, Section 2, defines each of the history windows, which comprise time steps, as Convolutional Neural Networks (alpha-CNN and beta-CNN), which “capture the temporal structure” of the sequence.  
Zhong, Wang, Agashe, Begleiter, Thorhallsson, Campbell, and Pan are analogous art because they are all in the field of machine learning.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Pan with the combination of Zhong, Wang, Agashe, Begleiter, 

Claim 15 is rejected under 35 U.S.C. 103 as being unpatentable over Zhong in view of Wang, Agashe, Begleiter, Thorhallsson, and Pan.  
As per Claim 15, Zhong teaches an artificial intelligence system, comprising: a computing device including at least one processor, one or more data storage devices, and a non-transitory data storage medium interfaced with the at least one processor, the non-transitory data storage medium containing instructions that, when executed cause the at least one processor to (Zhong, Abstract, discloses:  “This model employs neural network architecture incorporating a predictive sensory module based on an RNNPB (Recurrent Neural Network with Parametric Biases) and a horizontal product model.”  Here, Zhong discloses an artificial intelligence system.  Zhong, last sentence of Section 1, discloses:  “Therefore it is sufficient to serve as a neural model that is running in a robot CPU to explain the language development in the cortical areas.” Here, Zhong discloses a computing device including at least one processor (a robot with a CPU), which also implies one or more data storage devices, and a non-transitory data storage medium interfaced with the at least one processor, the non-transitory data storage medium containing instructions that cause the at least one processor to execute.)
Zhong, Page 3 Lines 4-10, discloses:  “A ball was attached at the end of the arm so that another robot could obtain the movement by tracking the ball. Our neural network was trained and run on the other NAO, which was called the “observer”. In this way, the observer robot perceived the object movement from its vision passively, so that its network took the object’s visual features and the movements into account”.  Here, Zhong discloses observed sensory sequence information, as the “observer robot” “perceived the object movement from its vision”, wherein “perceived” is another word for observed, and “vision” is sensory information, and “movement” implies this is over time, and thus a sequence of observed images.  Zhong, Section 2.2 also discloses “error back-propagated from a certain sensory information sequence to the PB units”.  While it is likely implied that the information is saved, Zhong explicitly recites to save it (“storing”) this in Conclusion Para 2 Lines 4-5: “These PB units allow for storing multiple sensory sequences.”)
Zhong fails to teach that the sequence information is saved in a plurality of history windows, the plurality of history windows being reverse chronological history windows, and perform a temporal convolution in a deep neural network to map observed sensory sequence information from the plurality of history windows to symbols.  Zhong also fails to teach wherein a size of the plurality of history windows increase exponentially from a last observed time step; apply a function to the observed sensory sequence information in each history window, wherein the function maps the observed sensory sequence information into a fixed set of discrete classes, fixed for all of the plurality of history windows; choose at least one parameter of the exponentially increasing history window size as a hyperparameter to allow 
Wang teaches that the sequence information is saved in a plurality of history windows, the plurality of history windows being reverse chronological history windows. (Wang, Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history. ”   Wang, Section 2 Last Sentence, discloses “Also distinct from RNN, genCNN gains most of its processing power from the heavy-duty processing units (i.e., CNN and CNNs), which follow a bottom-up information flow and yet can adequately capture the temporal structure in word sequence with its convolutional-gating architecture.”  Examiner’s Note:  Here, alphaCNN and betaCNN are the plurality of history windows, and they are saving word sequence information. These windows go back in time progressively further, and are thus reverse chronological history windows).
and perform a temporal convolution in a deep neural network to map observed sensory sequence information from the plurality of history windows to symbols (Wang, Section 2, defines each of the history windows as Convolutional Neural Networks (alpha-CNN and beta-CNN), which “capture the temporal structure” of the sequence.  Figures 2 and 4 illustrate that each CNN makes a prediction of the next word (i.e., symbol).  Zhong, Conclusion Para 2 Lines 4-5, discloses that the information is sensory: “These PB units allow for storing multiple sensory sequences.”)  
Zhong and Wang are analogous art because they are both in the field of machine learning.  
Zhong teaches a method to make a prediction based on historical sequence information, as stated in Zhong Section 3.3:  “In this simulation, the obtained PB units from the previous recognition experiment were used to generate the predicted movements using the prior knowledge of a specific object.”  Wang also teaches a method of making a prediction based on historical sequence information as stated in Wang Abstract:  “Instead, we use a convolutional neural network to predict the next word with the history of words of variable length”, however Wang adds a plurality of history windows.  Zhong discloses using an RNN as stated in Zhong, Abstract:  “This model employs neural network architecture incorporating a predictive sensory module based on an RNNPB (Recurrent Neural Network with Parametric Biases)”, and Wang discloses replacing RNN with their genCNN model as stated in Wang, Abstract:  “Different from previous work on neural network-based language modeling and generation (e.g., RNN or LSTM), we choose not to greedily summarize the history of words as a fixed length vector. Instead, we use a convolutional neural network”.  To apply Wang’s replacement of RNN with genCNN to Zhong would result in being able to make a sensory sequence prediction based on a plurality of sensory sequence information windows.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the 
The combination of Zhong and Wang fails to teach wherein a size of the plurality of history windows increase exponentially, starting from a last observed time step of the previous history window so that the windows do not overlap. The combination of Zhong and Wang also fails to teach apply a function to the observed sensory sequence information in each history window, wherein the function maps the observed sensory sequence information into a fixed set of discrete classes, fixed for all of the plurality of history windows; choose at least one parameter of the exponentially increasing history window size as a hyperparameter to allow the system to trade off bias-variance; apply a context tree weighting algorithm to an alphabet resulting from the fixed set of discrete classes for each of the plurality of history windows to predict a future discrete sequence; wherein the temporal convolution includes defining each of the plurality of history windows of events as a 2^k-by-n matrix, where 2^k is a number of time steps in each of the plurality of history windows and n is a number of events at each of the time steps, and applying a convolution that is an l-by-n matrix, where l is less than 2^k, wherein a set of the convolutions produces a new set of events.
Agashe teaches wherein a size of the plurality of history windows increase exponentially, starting from a last observed time step of the previous history window so that the windows do not overlap. (Agashe Para [0076] discloses “FIGS. 5A and 5B illustrate an example linear buffer 500 representing a time window comprising exponentially-decaying time intervals and a series of associated counters in accordance with an embodiment of the invention. The time window and counters may track the occurrence of a particular type of event involving a particular asset. In FIG. 5A, element 501 represents the most recent 1-minute time interval beginning at 12:44 and ending at 12:45, element 502 represents a previous 2-minute time interval beginning at 12:42 and ending at 12:44, element 503 represents a previous 4-minute time interval beginning at 12:38 and ending at 12:42, and element 504 represents a previous 8-minute time interval beginning at 12:30 and ending at 12:38. Each element stores a counter associated with the time interval representing the element. At 12:30, the time window was created and the linear buffer was initialized. In FIG. 5A, approximately 15 minutes have elapsed between the creation of the time window and the current time, 12:44:23. A counter associated with the element 501 has been incremented 7 times, indicating that 7 events have occurred in the time interval represented by the element 501. A counter associated with the element 502 has been incremented 34 times, indicating that 34 events have occurred in the time interval represented by the element 502. A counter associated with the element 503 has been incremented 50 times, indicating that 50 events have occurred in the time interval representing the element 503. A counter associated with the element 504 has been incremented 72 times, indicating that 72 events have occurred in the time interval representing the element 504.”)

    PNG
    media_image1.png
    191
    694
    media_image1.png
    Greyscale


Zhong, Wang, and Agashe are analogous art because they are all in the field of machine learning.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Agashe with the combination of Zhong and Wang to include saving sequence information in exponentially increasing history windows.  One would have been motivated to do so in order to be able to make more accurate predictions based on more data, while saving on storage resources (Agashe, Para [0075]:  “Exponentially decaying time intervals may allow precise tracking of recent events while conserving storage capacity by passing a record of the occurrence of events to progressively larger time intervals as time progresses and the events become less recent”). 
The combination of Zhong, Wang, and Agashe fails to teach apply a function to the observed sensory sequence information in each history window, wherein the function maps the observed sensory sequence information into a fixed set of discrete classes, fixed for all of the plurality of history windows; apply a context tree weighting algorithm to an alphabet resulting from the fixed set of discrete classes for each of the plurality of history windows to predict a future discrete sequence. The combination of Zhong, Wang, and Agashe also fails to teach choose at least one parameter of the exponentially increasing history window size as a 
Begleiter teaches apply a function to the observed [sensory] sequence information [in each history window], wherein the function maps the observed [sensory] sequence information into a fixed set of discrete classes fixed set of discrete classes, fixed [for all of the plurality of history windows] (Begleiter, Section 3.4 Paragraph 2, discloses, for sequence information (“each alphabet symbol”), a function to map the information into a fixed set of discrete classes (“concatenating binary words of size k, one for each alphabet symbol”), wherein the function is converting the symbols to binary words, and these “binary words” comprise a fixed set of 2k discrete classes.  * Zhong, Conclusion Para 2 Lines 4-5, discloses that the information is sensory: “These PB units allow for storing multiple sensory sequences.”)  *Wang discloses saving sequence information in a plurality of history windows:  Wang, Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history.”
and apply a context tree weighting algorithm to an alphabet resulting from the fixed set of discrete classes [for each of the plurality of history windows] to predict a future discrete sequence (Begleiter, Section 3.4 Paragraph 2, a “standard binary ctw algorithm over a binary representation of the sequence”.  Begleiter, in the Abstract, identifies CTW as a sequence prediction algorithm: “prediction algorithms, including Context Tree Weighting (CTW)”.  Therefore, Begleiter discloses apply a context tree weighting algorithm to predict a future discrete sequence.  This is applied to an alphabet resulting from the fixed set of discrete classes, where this alphabet is the “binary representation” of Begleiter’s “each alphabet symbol” (which was previously mapped to “sequence information”), as the “binary representation” is resulting from the fixed set of discrete classes, which as shown above is the “binary words of size k”, comprising 2k discrete classes.  To clarify, the “alphabet” is resulting from the “fixed set of discrete classes” because it is the “fixed set of discrete classes” in binary form.   *Wang, discloses saving sequence information in a plurality of history windows:  Wang, Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history.”
Zhong, Wang, Agashe, and Begleiter are analogous art because they are all in the field of machine learning.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Begleiter with the combination of Zhong, Wang, and Agashe to include applying a CTW algorithm to a fixed set of discrete classes mapped from an alphabet.  One would have been motivated to do so for the purpose of “extending the ctw algorithm for large alphabets” (Begleiter, Section 3.4, Paragraph 1).
The combination of Zhong, Wang, Agashe, and Begleiter fails to teach choose at least one hyperparameter for each of the plurality of history windows to allow the system to trade off bias-variance.  The combination of Zhong, Wang, Agashe, and Begleiter also fails to teach wherein the temporal convolution includes defining each of the plurality of history windows of events as a 2^k-by-n matrix, where 2^k is a number of time steps in each of the plurality of history windows and n is a number of events at each of the time steps, and applying a convolution that is an l-by-n matrix, where l is less than 2^k, wherein a set of the convolutions produces a new set of events.
Thorhallsson teaches choose at least one hyperparameter [for each of the plurality of history windows] to allow the system to trade off bias-variance.  (Thorhallsson, Intro Para 2, discloses that “This leads to a fundamental tradeoff known as the bias-variance tradeoff which is of paramount importance for optimal choice of the hyperparameters for the learning algorithms”).  *Wang, Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history” (i.e., a plurality of history windows comprising CNNs, and CNNs are a machine learning model and therefore have a bias-variance tradeoff).
 Zhong, Wang, Agashe, Begleiter, and Thorhallsson are analogous art because they are all in the field of machine learning.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Thorhallsson with the combination of Zhong, Wang, Agashe, and Begleiter to include selecting the optimal choice of hyperparameters, for which the bias-variance tradeoff is of 
The combination of Zhong, Wang, Agashe, Begleiter, and Thorhallsson fails to teach wherein the temporal convolution includes defining each of the plurality of history windows of events as a 2^k-by-n matrix, where 2^k is a number of time steps in each of the plurality of history windows and n is a number of events at each of the time steps, and applying a convolution that is an l-by-n matrix, where l is less than 2^k, wherein a set of the convolutions produces a new set of events.
Pan teaches wherein the [temporal] convolution includes defining each [of the plurality of history windows] of events as a (2^k)-by-n matrix, where 2^k is a number of [time steps in each of the plurality of history windows] and n is a number of events [at each of the time steps], and applying a convolution that is an l-by-n matrix, where l is less than 2^k, wherein a set of the convolutions produces a new set of events. (Pan Para [0054] discloses an “n*m feature matrix” in which one dimension “m is the quantity of sub time periods” (i.e., time steps) and the other dimension “n is the quantity of feature types” (i.e., events).  Pan Para [0060] discloses that  “The input layer inputs each sample (n*m feature matrix) to a convolutional layer for convolution.”  Pan further discloses “convolution kernel quantity and size can be specified as needed” (Examiner’s note:  a convolution kernel is a matrix) wherein “at least one of a row quantity or a column quantity of the convolution kernel can be a predetermined quantity of feature types”, which is analogous to the instant application, where “n” is the column dimension for both the 2^k-by-n matrix and the l-by-n matrix (Pan describes an “n-by-m feature matrix”, and “a size of each convolution kernel can be n*j”).  Pan further discloses that the remaining dimension of the convolution kernel must be less than the corresponding dimension of the feature matrix (“size of each convolution kernel can be n*j, where j is a positive integer less than m”).  This is analogous to the instant application, where l must be less than 2^k.  Pan Para [0060] further states that “convolutional layer can output 100s feature graphs.” (i.e. the output is a new set of events).  *Wang teaches temporal convolutions in a plurality of history windows:  Wang, Section 2 Sentence 1, discloses “As shown in Figure 1, genCNN is overall recursive, consisting of CNN-based processing units of two types:  alphaCNN as the ‘front-end’, dealing with the history that is closest to the prediction; betaCNNs (which can repeat), in charge of more ‘ancient’ history.”  Wang, Section 2, defines each of the history windows, which comprise time steps, as Convolutional Neural Networks (alpha-CNN and beta-CNN), which “capture the temporal structure” of the sequence.  
Zhong, Wang, Agashe, Begleiter, Thorhallsson, and Pan are analogous art because they are all in the field of machine learning.  Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine the teachings of Pan with the combination of Zhong, Wang, Agashe, Begleiter, and Thorhallsson to include a convolution kernel wherein at least one of a row quantity or a column quantity of the convolution kernel can be a predetermined quantity of feature types.  One would have been motivated to do so “because feature types (i.e. events) are not continuous, the convolution kernel does not need to be scanned in a distribution direction of each feature type”. (Pan, Para [0060], Sentence 3)

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.
Wu et. al. (“A Novel Sensory Mapping Design for Bipedal Walking on a Sloped Surface”) discloses making a prediction based on a historical sequence of sensory sequence data 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LEONARD A SIEGER whose telephone number is (571)272-9710.  The examiner can normally be reached on M-F 8:00 am - 5:00 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann Lo can be reached on (571) 272-9767.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access 
/L.A.S./Examiner, Art Unit 2126                                                                                                                                                                                                        
/LUIS A SITIRICHE/Primary Examiner, Art Unit 2126