DETAILED ACTION
This Action is in response to amendments and arguments filed 6 April 2022 for application 15/938411 filed 28 March 2018. Currently claims 1-5, 7-14, and 16-19 are pending. Claims 6, 15, and 20 have been canceled. Claim rejections under 35 USC 112(b) have been withdrawn in light of the amendments. 

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Response to Arguments
Applicant's arguments filed 6 April 2022 have been fully considered but they are not persuasive. 

Specifically, the Applicants Argue:
In response to the Office Action dated December 6, 2021, Applicants respectfully request reconsideration over claims as amended. Prior to entry of this Amendment, claims 1-20 were pending for examination, with claims 1, 12, and 17 being independent claims. In this paper, the independent claims have been amended with limitations found in the dependent claims 6, 15, and 20 considered allowable during the European stage of the prosecution of this application. …Claims, as amended, are directed to an apparatus for controlling a system using a neural network with a connectivity structure simplified based on the probability of subsequent occurrence of the events in “different sources of signals” in the system. The Office Action equates the sources in the system to the description of “events such as changes in ohmic resistance, internal states of the battery, or state of health of the battery.” Such interpretation is unreasonably broad because contradicts the plain meaning of the positively claimed limitations. Indeed, the ohmic resistance and health of the battery are conditions of the battery, which are always present, and not the events that may or may not occur. Given this interpretation, to read on the claims as amended, the combination of the prior art references should read on a system that uses a neural network with connectivity determined as a frequency of occurrences between changes in the ohmic resistance and the health of the battery with the connection established only when the probability of subsequent changes of the resistance and health is above the threshold. With all due respect, this does not make technical sense, and the cited prior art is completely silent on even the possibility of this correlation. In a legal sense, this means that the proposed modification has little chance of success and thus a skilled artisan would not be motivated to modify the references as suggested in the Office Action. Therefore, for at least these reasons, claims 1, 12, and 17 patentably distinguish over the prior art of record and are allowable. Reconsideration and withdrawal of the rejection of claims 1, 12, and 17 and all claims dependent therefrom are therefore respectfully requested. As a further note, Andre uses a mathematical model that is difficult to modify with statistical analysis taught by Lendaris and Conant. Lendaris deals with NN pre-structuring using extended dependency analysis for dimension reduction. The analysis of Lendaris is from dynamical and statistical perspective via graph without saying how to build a NN. The approach is not deterministic. Our method can be considered deterministic because the structure of the neural network stored in the memory is built on data and preexists its execution by the processor, as claimed. In fact, Lendaris's method cannot be used to build our signal connection matrix (fig 6a), which determines our NN structure, because Lendaris's method does not teach how to build our event table (fig 5a), which is used to build connection matrix. Similarly, Conant's work is on extended dependency analysis, it doesn't deal with NN. The system of Andre uses structured NN to estimate battery life. However, there is nothing about how to structure NN and is very different from the essence of the claimed invention. 


Examiner’s Response:
The Examiner respectfully disagrees and notes during examination, a claim must be given its broadest reasonable interpretation consistent with the specification (see M.P.E.P. 2173.01(1), M.P.E.P. 2111.01(11)). The Examiner also notes that the allowable of the claims determined during the European stage of the prosecution of this application cannot be by itself a basis for allowability at the USPTO.  The Examiner maintains that, as set forth in the NOFA, Andre, Yanqing, Lendaris, and Conant not only teach each element of the amended independent claims but also that the combination of each prior art with each other is obvious, appropriate, and properly motivated.  Specifically, Andre establishes the structured neural network and teaches the supervised training of a pre-structure neural network (SNN) with that structure posited according to an a priori relationship between variables. Even if the structure is pre-posited, the actual relationship is learned through the training process with that learning statistical in the sense of determining the best match between the training data and the network parameters ([p. 954, Section 4.2, p. 958, Section 6.1.2, Figure 5]). It is also noted that Andre also teaches the application of the Extended Kalman Filter to this same problem, which forms a statistical/probabilistic model for the corresponding stochastic process, thereby further indicating that Andre’s particular application is not incompatible with probabilistic/stochastic modelling methods. Although Andre teaches that his framework is intended to be integrated into a controller unit framework in the vehicle electrical system (particularly because of the throughput advantages of the neural network [p. 960, Section 7]), he does this in the context of future work. Yanqing merely augments Andre to incorporate the SNN into a control system which would have improved accuracy and time cost with the implementation of the structured neural network in a system using an adaptive controller to improve the system state condition estimate and to use it in a timely manner to identify and respond to system problems before they result in greater damage to that system ([p. 4, Section 3.2, p. 6, Section 5]). Although Yanqing teaches that the inputs to the neural network are time-delay components formed in the closed loop (adaptive) control system, he does not disclose connectivity based on a probability of subsequent occurrences. Moreover, although Andre points to a well known ANN construction method (see section 4.2) in which the structure of the network is pre-posited, Lendaris provides further details for the construction of the structured neural network, specifically by using extended dependency analysis (EDA); in other words, Lendaris teaches that the SNN of Andre can be constructed using EDA analysis which evaluates statistical correlations between variables of potential interest. Specifically Lendaris teaches that a neural network is structured on the basis of a probability computed from pairs of independent features (signals) in a time series through a statistical dynamic analysis process which determines a joint probability distribution involving each pair of independent features and the dependent (target/output) variable that is used to determine a probabilistic degree to which the dependent variable (i.e., that the events in the time series associated with the two independent features predict the dependent variable) is explained by the two independent features (chi-squared test, information theoretic analysis) and teaches that this probability is used to partially connect the neural network in a pre-structuring function on the basis of a certain number of pairs of features pre-specified for inclusion in that topology such that the probabilistic threshold that determines the connectivity corresponds to the reduction in uncertainty (table 1) computed from the joint and conditional probabilities (an information theoretic computation of a probabilistic confidence level or a chi-squared test) such that the particular probabilistic characterization of the least important pair over the specified number of pairs determines the threshold (viz., ([p. 406, Section 5, p. 407, Section 5.1, p. 409, Section 8.2, p. 412, Section 8.3.3,Table 1] An attribute of the EDA that turns out being especially useful in our NN design context is that in the process of determining the structural information (which is based on "binning" the continuous data to develop categorical data upon which the EDA operates), it also provides substantial information about the probability densities represented by the data. The latter may be incorporated in the NN design; a key benefit of such transition to the NN with this information is that the NN, in contrast to EDA, operates with the full metric (as opposed to categorical) problem domain., The dynamic algorithm consists of a pair of heunstics labeled H2 and H3, together with some minimal form of reconstructability analysis. For each dependent variable in the analysis, H2 calculates the three-way transmission between the dependent variable and every possible pair of independent variables in the analysis. These values provide a measure of how much the knowledge of independent variables reduces uncertainty about the dependent variable. The pairs of independent variables are sorted based on these transmission values, and those which pass a chi-squared significance test are put into the set of candidate variables. The maximum size of the candidate set is a user-determined parameter in this algorithm. When more than this number of significant variables are found, the most significant variables are saved and the others are discarded, The H2 analysis amounts to sequentially constructing approximate joint probability distributions for two independent variables (features or measurements) and the dependent variable (classifier output) based on the training data. These distributions can be decomposed into conditional distributions for each class (a step we will take in the next section), and allow us to calculate the degree to which "class" is explained by each possible pair of features or measurements., Now we would like to directly translate our joint probability distribution for the nominal variables into a neural classifier design for the quantitative features and measurements…. As an example, we can take the vectors Cj that partition our quantitative space and use them directly as instars to prototype elements in a Counterprop NN. Thus, the prototypes are completely specified by the binning scheme and the actual range of variable values observed. The outstar weights from each prototype element may be directly assigned, using the conditional probabilities of the classes in that region developed in the EDA.). Although Lendaris teaches the determination of neural network connectivity according to a statistical analysis of (pairwise) features obtained from a time series, he does not clearly disclose that this statistical analysis is focused on temporal causality (i.e., that one feature at a given time is analyzed statistically vis a vis another feature at a later or different time). Also, although Lendaris discloses neural connectivity according to statistical analysis but uses an information theoretic method in lieu of a statistically robust sample set and does not characterize the pertinent statistics according to a cross correlation between temporally offset features. However, Conant extends the EDA method of Lendaris to include temporal causality determinations among different variables in a time series by virtue of teaching the dynamic analysis of a multivariate time series for the purpose of determining causal dependencies between features based on the determination of the conditional probabilities that relate a feature (or set of features) at a particular time to a subsequent or a previous time in which these probabilities characterize the frequency of co-occurrence of events at temporal offsets over a period of time (e.g., over a year) (via., [p. 100,”Introduction”, p. 102, “Combinatorial considerations in PRA and DA”, p. 111, “DEDUCE”] In the use of DA for analysis of dynamic structure, one starts with N conditional distributions pi: {pi=p(j ISV), j =1,2, ... , N} each one involving the N variables of the system plus one variable j delayed in time, so that pi embodies the relationship between variable j at one time and all N variables of the system at one time unit earlier. It is assumed that the set of these distributions implicitly characterizes the dynamic behavior of the system; DA is the tool which extracts a description of that dynamic behavior from the conditional distributions, or in practice from a time record of actual behavior which implicitly represents them. The end result of DA for dynamic analysis is a set of N "dependencies" denoted by a variable, an arrow, and a set of variables, {j+-DU), D(j)cSV,j= I, ... , N}, The first or "data" constraint arises from the fact that since the analysis of a system is based upon observed probabilities (or frequencies) of the variables and sets of variables, then these probabilities must be supported by enough data to make them credibly representative of the inherent behavior of the system., Its result is a dependency set for each of the N variables, revealing the supposedly causal structure of the system by showing for each variable the other variables which serve as its predictors. In addition to the sets D(j), EDA also reveals for each j the strength of the predictive relationship, measured by T(j: D(j». If this transmission equals H(j), the entropy of j, then j is completely determined by the variables in DU), and at the opposite extreme if T(j: D(j» is only marginally sufficient to meet the chi-squared criterion for significance then the predictive relationship is weak.). Hence, while Lendaris teaches the application of extended dependency analysis for pre-structuring neural networks according to observed statistically significant correlations between pairs of variables in a time series in order to identify the significant explanatory variables, Conant extends the extended dependency analysis of multivariate time series data to specifically include the dependence between a variables with different temporal offsets.  Lendaris and Conant are relevant to Andre and Yaquing because they enable Andre and Yaquing to optimize the structuring of the neural network according to hypothesized (causal) variable dependences (e.g., the pertinent terms for f(SOC) and f(T) in equations 8 and 9). In other words, the teachings of Andre and Yaquing are directly augmented by the teachings of Lendaris and Conant because Lendaris and Conant would have enabled the improvement in the training and generalization of a neural network by pre-structuring it according to the statistical significance of pair-wise causal associations of input features into that neural network, thereby mitigating the exponential scaling of computation over the number of features.

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

As explained in MPEP § 2181, subsection I, claim limitations that meet the following three-prong test will be interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph:
(A)	the claim limitation uses the term “means” or “step” or a term used as a substitute for “means” that is a generic placeholder (also called a nonce term or a non-structural term having no specific structural meaning) for performing the claimed function; 
(B)	the term “means” or “step” or the generic placeholder is modified by functional language, typically, but not always linked by the transition word “for” (e.g., “means for”) or another linking word or phrase, such as “configured to” or “so that”; and 
(C)	the term “means” or “step” or the generic placeholder is not modified by sufficient structure, material, or acts for performing the claimed function. 
Use of the word “means” (or “step”) in a claim with functional language creates a rebuttable presumption that the claim limitation is to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites sufficient structure, material, or acts to entirely perform the recited function. 
Absence of the word “means” (or “step”) in a claim creates a rebuttable presumption that the claim limitation is not to be treated in accordance with 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. The presumption that the claim limitation is not interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, is rebutted when the claim limitation recites function without reciting sufficient structure, material or acts to entirely perform the recited function. 
Claim limitations in this application that use the word “means” (or “step”) are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action. Conversely, claim limitations in this application that do not use the word “means” (or “step”) are not being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, except as otherwise indicated in an Office action.
This application includes one or more claim limitations that do not use the word “means,” but are nonetheless being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, because the claim limitation(s) uses a generic placeholder that is coupled with functional language without reciting sufficient structure to perform the recited function and the generic placeholder is not preceded by a structural modifier.  Such claim limitation(s) is/are:
“input interface” in claim 1
“controller” in claim 1.
“neural network trainer” in claims 7 and 8.

Because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 1, 2, 7-8, 11-13, and 16-18are rejected under 35 U.S.C. 103 as being unpatentable over Andre et al. (“Comparative study of a structured neural network and an extended Kalman filter for state of health determination of lithium-ion batteries in hybrid electric vehicles”, Engineering Applications of Artificial Intelligence, 26, 2013, pp. 951-961), hereinafter referred to as Andre,  in view of Shen Yanqing (“Adaptive online state-of-charge determination based on neuro-controller and neural network”, Energy Conversion and Management, 51, 2010, pp. 1093-1098), hereinafter referred to as Yanqing, in view of Lendaris et al. (“Prestructuring neural networks via extended dependency analysis with application to pattern classification”, Proc. SPIE 3722, Applications and Science of Computational Intelligence II, 22 March 1999, pp. 402-413), hereinafter referred to as Lendaris, and in further view of Roger Conant (“Extended Dependency Analysis of Large Systems”, International Journal of General Systems, 14, 1988, pp. 97-123), hereinafter referred to as Conant.  

In regards to claim 1, Andre teaches An apparatus for controlling a system including a plurality of sources of signals 2causing a plurality of events, comprising:  3an input interface to receive signals from the sources of signals;  ([p. 951, Section 1, p. 954, Section 4.1, p. 954, Section 4.2, Figure 5] Moreover, an online tracking of the resistance change is indispensable to prevent failures of the electric system and accordingly the vehicle too. Besides, it can provide further advantages such as the recognition of an on-going or abrupt deterioration; or a residual value determination of the battery in case of a vehicle repair or resale., In this work, the system of Fig. 2 is the battery with measured values of SOC, temperature and current as input values u of the network and voltage as the desired output y., Consequently, a SNN is set up for the estimation of the ohmic resistance Rohm by considering temperature and SOC as further input signals., wherein a set of signals are measured in a system for monitoring the state of a battery for the purpose of identifying problems in a timely fashion (i.e., for controlling the robustness of an electrical system by mitigating failures to that system) in which the signals include temperature, SOC (state of charge), and current associated with the battery which quantify events such as changes in ohmic resistance, internal states of the battery, or state of health of the battery such that these signals are used as inputs (i.e., received at an input interface) to a neural network configuration (Figure 5).) 4a memory to store a neural network trained to diagnose a control state of the 5system, wherein the neural network includes a sequence of layers, each layer 6includes a set of nodes, each node of an input layer and a first hidden layer 7following the input layer corresponds to a source of signal in the system, wherein a 8pair of nodes from neighboring layers corresponding to a pair of different sources 9of signals are connected in the neural network …, such that the neural network is a partially connected neural 12network; ([p. 954, Section 4.2, p. 958, Section 6.1.2, Figure 5] Based on these equations, it can be seen that a one-layer network structure is sufficient to represent the model of R_ohm. More layers would be possible, but are not necessary to describe the dependencies in consequence of the model structure itself. Now, the SNN is adjusting the parameters in such a way that the functions yield to the target values….The physical relationship between input and output data is established through the weight adaptation of the parameters k, n as well as p0–p4 during the training of Eq. (5)., Now, the trained SNN is validated by the output value u for MOL data sets at 13,000 km and 80,000 km, chosen as values near BOL and between BOL and EOL to represent a normal lifecycle, in Figs. 10 and 11, respectively., wherein the signals are input into a structured neural network as shown in Figure 5 (with T, SOC, i_meas  input at multiple distinct input nodes) in which various nodes of a hidden layer selectively combine weighted inputs associated with distinct signals (e.g., imeas, T, and SOC are combined in the product hidden node of SNN1) according to the structuring (i.e., the network is partially connected) and wherein the neural network is trained/designed to diagnose the state of health of the battery such that the learned network parameters are retained (i.e., stored in a memory) for application to a test/validation set (to assess its efficacy).)  13a processor to submit the signals into the neural network to produce the 14control state of the system; …([p. 958, Section 6.1.3, Figure 5] Finally, the trained SNN is applied on a second data set with BOL and EOL data in order to investigate the influence of mileage on the increase of the ohmic resistance Rohm. Again, reference data gained by EIS are used for validation at BOL. In order to obtain suitable and comparable curves, the data were filtered by a first order Butterworth algorithm. In Fig. 12 the results of the smoothed SNN obtained Rohm and the reference data are displayed., wherein the input signals are processed by the neural network to estimate the internal states of the battery and, from those states, to estimate the state of health of the battery (either of which are control states of the system).) As noted above, “input interface” in the claims is being interpreted as a generic placeholder without the recitation of sufficient accompanying structure to perform the function; a review of the specification shows that the following appears to be the corresponding structure described in the specification: [0085, 0086] “The apparatus 900 includes an input interface to receive signals from the sources of signals of the controlled system. For example, in some implementations, the input interface includes a human machine interface 910 within the apparatus 900 that connects the processor 920 to a keyboard 911 and pointing device 912, wherein the pointing device 912 can include a mouse, trackball, touchpad, joy stick, pointing stick, stylus, or touchscreen, among others., Additionally, or alternatively, the input interface can include a network interface controller 950 adapted to connect the apparatus 900 through the bus 906 to a network 990. Through the network 990, the signals 995 from the controlled system can be downloaded and stored within the storage system 930 as training and/or operating data 934 for storage and/or further processing. The network 990 can be wired or wireless network connecting the apparatus 900 to the sources of the controlled system or to an interface of the controlled system for providing the signals and metadata of the signal useful for the diagnostic.”
However, Andre does not explicitly teach 4… …only when a probability of 10subsequent occurrence of the events in the pair of the different sources of signals is 11above a threshold, … wherein the probability of subsequent occurrence of the events in the pair of the different sources of signals is a function of a frequency of the subsequent occurrence of the events in the signals collected over period; … and  15a controller to execute a control action selected according to the control state 16of the system.  In other words, Andre does not disclose that the partial (in a time stream) connectivity is determined by a probabilistic causal relationship between different signals but rather according to known a priori causal relationships (equations 6-9) and, although Andre teaches the his framework is intended to be integrated into a controller unit framework in the vehicle electrical system (particularly because of the throughput advantages of the neural network [p. 960, Section 7]), he does this in the context of future work.
However, Yanqing, in the analogous environment of using a neural network to monitor and assess system health, teaches and  15a controller to execute a control action selected according to the control state 16of the system ([p. 2, Section 2.3, p. 3, Section 3.1, p. 4, Section 3.2, Figure 1] As shown in Fig. 1, artificial neural network (ANN) based battery system model is established. The input of RBF NN includes those parameters of the above discrete state space based model, which is … , As depicted in Figs. 1 and 3, an artificial neural network (ANN) based inverse battery system is erected to evaluate battery terminal voltage, where y is the measured cell terminal voltage, y^ is the estimation value of ANN based model, and the prediction error is e ¼ y y^. To predict SOC more quickly and stably, this paper employs a neural network controller [14]., Syncretizing (14) and (18), we establish a modified PID controller, together with BPNN model, to evaluate cell SOC, which converges to the real value as time goes on and ceases calculation when reaching the desired prediction error e or the given maximum simulation step N. Once simulation time exceeds NT, it means two statuses: one is that there exists some problems in the battery’s state of health (SOH), and customers should take measures to improve its status; the other is that the established model is not fit for the current tested cell, users should employ corresponding suitable models to determinate its SOC., wherein a neural network processes inputs system (battery) to generate an output indicative of the state of that system (Kp, KI, KD that characterize the dynamics of the system as required for SOC estimation) such that a control action is executed by a controller in the form of the feedback signal in the adaptive neural network system such that this adaptive feedback process provides an indication to the customers or users if it fails to achieve timely convergence (i.e., this forms/executes another control signal that invokes the execution of another control action involving human intervention.).) As noted above, “controller” in the claims is being interpreted as a generic placeholder without the recitation of sufficient accompanying structure to perform the function; a review of the specification shows that the following appears to be the corresponding structure described in the specification: [0087, 0088] “The control action can be configured and/or selected based on a type of the controlled system. For example, the controller can render the results of the diagnosis. For example, the apparatus 900 can be linked through the bus 906 to a display interface 960 adapted to connect the apparatus 900 to a display device 965, wherein the display device 965 can include a computer monitor, camera, television, projector, or mobile device, among others., Additionally, or alternatively, the controller can be configured to directly or indirectly control the system based on results of the diagnosis. For example, the apparatus 900 can be connected to a system interface 970 adapted to connect the apparatus to the controlled system 975 according to one embodiment. In one embodiment, the controller executes a command to stop or alter the manufacturing procedure of the controlled manufacturing system in response to detecting an anomaly.”
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Andre to incorporate the teachings of Yanqing for the controller to execute a control action according to the control state of the system diagnosed by the neural network.  The modification would have been obvious because one of ordinary skill would have been motivated to improve accuracy and time cost in the implementation of a neural network in a system using an adaptive controller to improve the system state condition estimate and to use it in a timely manner to identify and respond to system problems before they result in greater damage to that system ([p. 4, Section 3.2, p. 6, Section 5]).
However, Andre and Yanqing do not explicitly teach 4… …only when a probability of 10subsequent occurrence of the events in the pair of the different sources of signals is 11above a threshold, … wherein the probability of subsequent occurrence of the events in the pair of the different sources of signals is a function of a frequency of the subsequent occurrence of the events in the signals collected over period;. Although Yanqing teaches that the inputs to the neural network are time-delay components formed in the closed loop (adaptive) control system, he does not disclose connectivity based on a probability of subsequent occurrences. 
However, Lendaris, in the analogous environment of designing structured neural networks, teaches diagnose a control state of the 5system, wherein the neural network includes a sequence of layers, each layer 6includes a set of nodes, each node of an input layer and a first hidden layer 7following the input layer corresponds to a source of signal in the system, wherein a 8pair of nodes from neighboring layers corresponding to a pair of different sources 9of signals are connected in the neural network only when a probability of … 10……occurrence of the events in the pair of the different sources of signals is 11above a threshold, such that the neural network is a partially connected neural 12network; ([p. 406, Section 5, p. 407, Section 5.1, p. 409, Section 8.2, p. 412, Section 8.3.3,
Table 1] An attribute of the EDA that turns out being especially useful in our NN design context is that in the process of determining the structural information (which is based on "binning" the continuous data to develop categorical data upon which the EDA operates), it also provides substantial information about the probability densities represented by the data. The latter may be incorporated in the NN design; a key benefit of such transition to the NN with this information is that the NN, in contrast to EDA, operates with the full metric (as opposed to categorical) problem domain., The dynamic algorithm consists of a pair of heunstics labeled H2 and H3, together with some minimal form of reconstructability analysis. For each dependent variable in the analysis, H2 calculates the three-way transmission between the dependent variable and every possible pair of independent variables in the analysis. These values provide a measure of how much the knowledge of independent variables reduces uncertainty about the dependent variable. The pairs of independent variables are sorted based on these transmission values, and those which pass a chi-squared significance test are put into the set of candidate variables. The maximum size of the candidate set is a user-determined parameter in this algorithm. When more than this number of significant variables are found, the most significant variables are saved and the others are discarded, The H2 analysis amounts to sequentially constructing approximate joint probability distributions for two independent variables (features or measurements) and the dependent variable (classifier output) based on the training data. These distributions can be decomposed into conditional distributions for each class (a step we will take in the next section), and allow us to calculate the degree to which "class" is explained by each possible pair of features or measurements., Now we would like to directly translate our joint probability distribution for the nominal variables into a neural classifier design for the quantitative features and measurements…. As an example, we can take the vectors Cj that partition our quantitative space and use them directly as instars to prototype elements in a Counterprop NN. Thus, the prototypes are completely specified by the binning scheme and the actual range of variable values observed. The outstar weights from each prototype element may be directly assigned, using the conditional probabilities of the classes in that region developed in the EDA., wherein a neural network is structured on the basis of a probability computed from pairs of independent features (signals) in a time series through a statistical dynamic analysis process which determines a joint probability distribution involving each pair of independent features and the dependent (target/output) variable that is used to determine a probabilistic degree to which the dependent variable (i.e., that the events in the time series associated with the two independent features predict the dependent variable) is explained by the two independent features (chi-squared test, information theoretic analysis), wherein this probability is used to partially connect the neural network in a pre-structuring function on the basis of a certain number of pairs of features pre-specified for inclusion in that topology such that the probabilistic threshold that determines the connectivity corresponds to the reduction in uncertainty (table 1) computed from the joint and conditional probabilities (an information theoretic computation of a probabilistic confidence level or a chi-squared test) such that the particular probabilistic characterization of the least important pair over the specified number of pairs determines the threshold.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Andre and Yaquing to incorporate the teachings of Lendaris to partially connect a sequence of layers in a neural network in which the connectivity from the input layer to the hidden layer is determined by the probability of occurrence of events in a pair of the different sources of signals exceeding a threshold.  The modification would have been obvious because one of ordinary skill would have been motivated to improve training and generalization of a neural network by pre-structuring it according to the statistical significance of pair-wise associations of input features into that neural network, including when those features are derived from a time series, thereby mitigating the exponential scaling of computation over the number of features (Lendaris, [Abstract, p. 402, Section 1, p. 412, Section 9]).
However, Andre, Yanqing, and Lendaris do not teach …subsequent… wherein the probability of subsequent occurrence of the events in the pair of the different sources of signals is a function of a frequency of the subsequent occurrence of the events in the signals collected over period.  In other words, although Lendaris teaches the determination of neural network connectivity according to a statistical analysis of (pairwise) features obtained from a time series, he does not clearly disclose that this statistical analysis is focused on temporal causality (i.e., that one feature at a given time is analyzed statistically vis a vis another feature at a later time). Andre and Yanqing do not determine neural connectivity according to a probability. Lendaris discloses neural connectivity according to statistical analysis but uses an information theoretic method in lieu of a statistically robust sample set and does not characterize the pertinent statistics according to a cross correlation between temporally offset features. 
However, Conant, in the analogous environment of designing structured neural networks, teaches wherein a 8pair of nodes … corresponding to a pair of different sources 9of signals are connected … when a probability of 10subsequent occurrence of the events in the pair of the different sources of signals is 11above a threshold, such that the … is a partially connected …, ([p. 100,”Introduction”, p. 102, “Combinatorial considerations in PRA and DA”, p. 111, “DEDUCE”] In the use of DA for analysis of dynamic structure, one starts with N conditional distributions pi: {pi=p(j ISV), j =1,2, ... , N} each one involving the N variables of the system plus one variable j delayed in time, so that pi embodies the relationship between variable j at one time and all N variables of the system at one time unit earlier. It is assumed that the set of these distributions implicitly characterizes the dynamic behavior of the system; DA is the tool which extracts a description of that dynamic behavior from the conditional distributions, or in practice from a time record of actual behavior which implicitly represents them. The end result of DA for dynamic analysis is a set of N "dependencies" denoted by a variable, an arrow, and a set of variables, {j+-DU), D(j)cSV,j= I, ... , N}, The first or "data" constraint arises from the fact that since the analysis of a system is based upon observed probabilities (or frequencies) of the variables and sets of variables, then these probabilities must be supported by enough data to make them credibly representative of the inherent behavior of the system., Its result is a dependency set for each of the N variables, revealing the supposedly causal structure of the system by showing for each variable the other variables which serve as its predictors. In addition to the sets D(j), EDA also reveals for each j the strength of the predictive relationship, measured by T(j: D(j». If this transmission equals H(j), the entropy of j, then j is completely determined by the variables in DU), and at the opposite extreme if T(j: D(j» is only marginally sufficient to meet the chi-squared criterion for significance then the predictive relationship is weak., wherein dynamic analysis of a multivariate time series for the purpose of determining causal dependencies between features is based on the determination of the conditional probabilities that relate a feature (or set of features) at a particular time to a subsequent or a previous time in which these probabilities characterize the co-occurrence of events at temporal offsets over a period of time (e.g., over a year) and are used to define the connections between the causally correlated features/signals.) wherein the probability of subsequent occurrence of the events in the pair of the different sources of signals is a function of a frequency of the subsequent occurrence of the events in the signals collected over period; ([p. 100,”Introduction”, p. 102, “Combinatorial considerations in PRA and DA”, p. 111, “DEDUCE”] In the use of DA for analysis of dynamic structure, one starts with N conditional distributions pi: {pi=p(j ISV), j =1,2, ... , N} each one involving the N variables of the system plus one variable j delayed in time, so that pi embodies the relationship between variable j at one time and all N variables of the system at one time unit earlier. It is assumed that the set of these distributions implicitly characterizes the dynamic behavior of the system; DA is the tool which extracts a description of that dynamic behavior from the conditional distributions, or in practice from a time record of actual behavior which implicitly represents them. The end result of DA for dynamic analysis is a set of N "dependencies" denoted by a variable, an arrow, and a set of variables, {j+-DU), D(j)cSV,j= I, ... , N}, The first or "data" constraint arises from the fact that since the analysis of a system is based upon observed probabilities (or frequencies) of the variables and sets of variables, then these probabilities must be supported by enough data to make them credibly representative of the inherent behavior of the system., Its result is a dependency set for each of the N variables, revealing the supposedly causal structure of the system by showing for each variable the other variables which serve as its predictors. In addition to the sets D(j), EDA also reveals for each j the strength of the predictive relationship, measured by T(j: D(j». If this transmission equals H(j), the entropy of j, then j is completely determined by the variables in DU), and at the opposite extreme if T(j: D(j» is only marginally sufficient to meet the chi-squared criterion for significance then the predictive relationship is weak., wherein dynamic analysis of a multivariate time series for the purpose of determining causal dependencies between features is based on the determination of the conditional probabilities that relate a feature (or set of features) at a particular time to a subsequent or a previous time in which these probabilities characterize the frequency of co-occurrence of events at temporal offsets over a period of time (e.g., over a year).) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Andre, Yanqing, and Lendaris to incorporate the teachings of Conant to partially connect a sequence of layers in a neural network in which the connectivity from the input layer to the hidden layer is determined by the probability of the subsequent occurrence of events in a pair of the different sources of signals according to a frequency of the subsequent occurrence of the events collected over a period.  The modification would have been obvious because one of ordinary skill would have been motivated to improve the efficiency of performing dependency analysis of multi-variate time series by overcoming combinatorial constraints through dynamic analysis of features which identifies important causal relationships through features using statistical techniques with various levels of statistical support (Conant, [Abstract, pp. 102-104, “Combinatorial Considerations in PRA and DA”, p. 123, “Conclusion”]).

1In regards to claim 2, the rejection of claim 1 is incorporated and Andre further teaches wherein a number of nodes in the input layer equals a 2multiple of a number of the sources of signals in the system, and a number of 3nodes in the first hidden layer following the input layer equals the number of the 4sources of signals, wherein the input layer is partially connected to the first hidden 5layer based on … occurrence of the events in different sources of signals. ([p. 954, Section 4.2, Figure 5] Based on these equations, it can be seen that a one-layer network structure is sufficient to represent the model of R_ohm. More layers would be possible, but are not necessary to describe the dependencies in consequence of the model structure itself. Now, the SNN is adjusting the parameters in such a way that the functions yield to the target values….The physical relationship between input and output data is established through the weight adaptation of the parameters k, n as well as p0–p4 during the training of Eq. (5)., wherein, for SNN1 (see Figure 5), the number of nodes in the input layer is 6 while the number of nodes in the first hidden layer is 3 (fohm(T), fohm(SOC), the product operation with imeas), such that the number of nodes in the hidden layer (3) is equal to the number of distinct signals that are input into the input layer (T, Soc, and imeas) , and wherein the connectivity between the input layer and the hidden layer is partial based on the deterministic association between those features (i.e., according to the known a priori system dynamics that relates feature-based events).) 
However, Andre and Yanqing do not explicitly teach 
… probabilities of subsequent… The connectivity is determined by a priori information about system dynamics rather than by statistics in Andre. Yanqing does not make use of partial connectivity.
However, Lendaris, in the analogous environment of designing structured neural networks, teaches …wherein the input layer is partially connected to the first hidden 5layer based on probabilities of … occurrence of the events in different sources of signals; ([p. 406, Section 5, p. 407, Section 5.1, p. 409, Section 8.2, p. 412, Section 8.3.3,
Table 1] An attribute of the EDA that turns out being especially useful in our NN design context is that in the process of determining the structural information (which is based on "binning" the continuous data to develop categorical data upon which the EDA operates), it also provides substantial information about the probability densities represented by the data. The latter may be incorporated in the NN design; a key benefit of such transition to the NN with this information is that the NN, in contrast to EDA, operates with the full metric (as opposed to categorical) problem domain., The dynamic algorithm consists of a pair of heunstics labeled H2 and H3, together with some minimal form of reconstructability analysis. For each dependent variable in the analysis, H2 calculates the three-way transmission between the dependent variable and every possible pair of independent variables in the analysis. These values provide a measure of how much the knowledge of independent variables reduces uncertainty about the dependent variable. The pairs of independent variables are sorted based on these transmission values, and those which pass a chi-squared significance test are put into the set of candidate variables. The maximum size of the candidate set is a user-determined parameter in this algorithm. When more than this number of significant variables are found, the most significant variables are saved and the others are discarded, The H2 analysis amounts to sequentially constructing approximate joint probability distributions for two independent variables (features or measurements) and the dependent variable (classifier output) based on the training data. These distributions can be decomposed into conditional distributions for each class (a step we will take in the next section), and allow us to calculate the degree to which "class" is explained by each possible pair of features or measurements., Now we would like to directly translate our joint probability distribution for the nominal variables into a neural classifier design for the quantitative features and measurements…. As an example, we can take the vectors Cj that partition our quantitative space and use them directly as instars to prototype elements in a Counterprop NN. Thus, the prototypes are completely specified by the binning scheme and the actual range of variable values observed. The outstar weights from each prototype element may be directly assigned, using the conditional probabilities of the classes in that region developed in the EDA., wherein a neural network is structured on the basis of a probability computed from pairs of independent features (signals) in a time series through a statistical dynamic analysis process which determines a joint probability distribution involving each pair of independent features and the dependent (target/output) variable that is used to determine a probabilistic degree to which the dependent variable (i.e., that the events in the time series associated with the two independent features predict the dependent variable) is explained by the two independent features (chi-squared test, information theoretic analysis), wherein this probability is used to partially connect the neural network in a pre-structuring function on the basis of a certain number of pairs of features pre-specified for inclusion in that topology such that the probabilistic threshold that determines the connectivity corresponds to the reduction in uncertainty (table 1) computed from the joint and conditional probabilities (an information theoretic computation of a probabilistic confidence level or a chi-squared test) such that the particular probabilistic characterization of the least important pair over the specified number of pairs determines the threshold.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Andre and Yanqing to incorporate the teachings of Lendaris to partially connect a sequence of layers in a neural network in which the connectivity from the input layer to the hidden layer is determined by the probability of occurrence of events in a pair of the different sources of signals exceeding a threshold in which the number of nodes in the input layer equals a multiple of the number of nodes in the first hidden layer.  The modification would have been obvious because one of ordinary skill would have been motivated to improve training and generalization of a neural network by pre-structuring it according to the statistical significance of pair-wise associations of input features into that neural network, including when those features are derived from a time series, thereby mitigating the exponential scaling of computation over the number of features for a given network topology having a specified number of input nodes and specified number of hidden nodes (Lendaris, [Abstract, p. 402, Section 1, p. 411, Section 8.3.1, p. 412, Section 9]).
However, Andre, Yanqing, and Lendaris do not teach …subsequent… In other words, although Lendaris teaches the determination of neural network connectivity according to a statistical analysis of (pairwise) features obtained from a time series, he does not disclose that this statistical analysis is focused on temporal causality (i.e., that one feature at a given time is analyzed statistically vis a vis another feature at a later time).
However, Conant, in the analogous environment of designing structured neural networks, teaches wherein the input … is partially connected … based on probabilities of subsequent occurrence of the events in different sources of signals12; ([p. 100,”Introduction”, p. 102, “Combinatorial considerations in PRA and DA”, p. 111, “DEDUCE”] In the use of DA for analysis of dynamic structure, one starts with N conditional distributions pi: {pi=p(j ISV), j =1,2, ... , N} each one involving the N variables of the system plus one variable j delayed in time, so that pi embodies the relationship between variable j at one time and all N variables of the system at one time unit earlier. It is assumed that the set of these distributions implicitly characterizes the dynamic behavior of the system; DA is the tool which extracts a description of that dynamic behavior from the conditional distributions, or in practice from a time record of actual behavior which implicitly represents them. The end result of DA for dynamic analysis is a set of N "dependencies" denoted by a variable, an arrow, and a set of variables, {j+-DU), D(j)cSV,j= I, ... , N}, The first or "data" constraint arises from the fact that since the analysis of a system is based upon observed probabilities (or frequencies) of the variables and sets of variables, then these probabilities must be supported by enough data to make them credibly representative of the inherent behavior of the system., Its result is a dependency set for each of the N variables, revealing the supposedly causal structure of the system by showing for each variable the other variables which serve as its predictors. In addition to the sets D(j), EDA also reveals for each j the strength of the predictive relationship, measured by T(j: D(j». If this transmission equals H(j), the entropy of j, then j is completely determined by the variables in DU), and at the opposite extreme if T(j: D(j» is only marginally sufficient to meet the chi-squared criterion for significance then the predictive relationship is weak., wherein dynamic analysis of a multivariate time series for the purpose of determining causal dependencies between features is based on the determination of the conditional probabilities that relate a feature (or set of features) at a particular time to a subsequent or a previous time in which these probabilities characterize the co-occurrence of events at temporal offsets over a period of time (e.g., over a year) and are used to define the connections between the causally correlated features/signals.) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Andre, Yanqing, and Lendaris to incorporate the teachings of Conant to partially connect a sequence of layers in a neural network in which the connectivity from the input layer to the hidden layer is determined by the probability of the subsequent occurrence of events in a pair of the different sources of signals.  The modification would have been obvious because one of ordinary skill would have been motivated to improve the efficiency of performing dependency analysis of multi-variate time series by overcoming combinatorial constraints through dynamic analysis of features which identifies important causal relationships through features using statistical techniques with various levels of statistical support (Conant, [Abstract, pp. 102-104, “Combinatorial Considerations in PRA and DA”, p. 123, “Conclusion”]).

In regards to claim 7, the rejection of claim 1 is incorporated, and Andre further teaches neural network trainer configured 3to evaluate the signals from the source of signals collected over a 4period of time … 13to form the neural network according to the connectivity structure of 14the neural network, such that a number of nodes in the input layer equals a 15first multiple of a number of the source of signals in the system, and a 16number of nodes in the first hidden layer following the input layer equals a 17second multiple of the number of the sources of signals, wherein the input 18layer is partially connected to the first hidden layer according to the 19connectivity structure; and  20to train the neural network using the signals collected over the period 21of time. ([p. 954, Section 4.2, p. 958, Section 6.1.2, Figure 5] The physical relationship between input and output data is established through the weight adaptation of the parameters k, n as well as p0–p4 during the training of Eq. (5)., All data were measured directly in the vehicle and contain all required signals like current, voltage and temperature. Most of the sets have a length of about five to seven hours and are therefore very suitable for a training or validation., wherein the pre-structured neural network SNN1 is trained using data collected over a specified period of time in which the structure of that neural network is determined according to a priori knowledge of system dynamics such that the neural topology exhibits a number of nodes in the input layer (6) which is a multiple of the number of source signals for that neural network (3 – T, SOC, imeas) and such that the nodes in the first hidden layer (3) is also a multiple of the number of source signals (i.e., multiple = 1).)  As noted above, “neural network trainer” in the claims is being interpreted as a generic placeholder without the recitation of sufficient accompanying structure to perform the function; a review of the specification shows that the following appears to be the corresponding structure described in the specification: [0093, 0099] “Fig. 10A shows a block diagram of a method used by a neural network trainer 933 to train the neural network 931 according to one embodiment. In this embodiment, the structure 932 of the neural network is determined from the probabilities of the subsequent occurrence of events, which are in turn functions of the frequencies of subsequent occurrence of events., Next, the embodiments train 1050 the neural network 1045 using the signals 1055 collected over the period of time. The signals 1055 can be the same or different from the signals 1005. The training 1050 optimizes parameters of the neural network 1045. The training can use different methods to optimize the weights of the network such as stochastic gradient descent.”
However, Andre and Yanqing do not explicitly disclose to determine frequencies of subsequent occurrence of events 5within the period of time for different combinations of pairs of sources of signals;  36MERL-3116 Guo et al. 7to determine probabilities of the subsequent occurrence of events for 8different combinations of the pairs of sources of signals based on the 9frequencies of subsequent occurrence of events within the period of time;  10to compare the probabilities of the subsequent occurrence of events 11for different combinations of pairs of sources of signals with the threshold to 12determine a connectivity structure of the neural network; The connectivity is determined by a priori information about system dynamics rather than by statistics in Andre. Yanqing does not make use of partial connectivity.
However, Lendaris, in the analogous environment of designing structured neural networks, teaches neural network trainer configured 3to evaluate the signals from the source of signals collected over a 4period of time to determine frequencies of subsequent occurrence of events 5within the period of time for different combinations of pairs of sources of signals;  36MERL-3116 Guo et al. 7to determine probabilities of the subsequent occurrence of events for 8different combinations of the pairs of sources of signals based on the 9frequencies of subsequent occurrence of events within the period of time;  10to compare the probabilities of the subsequent occurrence of events 11for different combinations of pairs of sources of signals with the threshold to 12determine a connectivity structure of the neural network;  to form the neural network according to the connectivity structure of 14the neural network, …wherein the input 18layer is partially connected to the first hidden layer according to the 19connectivity structure; and  20to train the neural network using the signals collected over the period 21of time ([p. 408, Section 8.1, p. 406, Section 4.3, p. 406, Section 5, p. 407, Section 5.1, p. 409, Section 8.2, pp. 411-412, Section 8.3.2, p. 412, Section 8.3.3,Table 1] During the 1974 project, 57 "features" were empirically extracted from the FDP data points[l 5]., Keeping in mind that the likely candidate NN structure to use in the absence of a priori information is the fully connected (structure-zero) NN, these results further press home the idea of obtaining as much apriori information as possible for use in selecting the NN structure before training. Similar positive results are demonstrated for the other research predictions suggested earlier., An attribute of the EDA that turns out being especially useful in our NN design context is that in the process of determining the structural information (which is based on "binning" the continuous data to develop categorical data upon which the EDA operates), it also provides substantial information about the probability densities represented by the data. The latter may be incorporated in the NN design; a key benefit of such transition to the NN with this information is that the NN, in contrast to EDA, operates with the full metric (as opposed to categorical) problem domain., The dynamic algorithm consists of a pair of heunstics labeled H2 and H3, together with some minimal form of reconstructability analysis. For each dependent variable in the analysis, H2 calculates the three-way transmission between the dependent variable and every possible pair of independent variables in the analysis. These values provide a measure of how much the knowledge of independent variables reduces uncertainty about the dependent variable. The pairs of independent variables are sorted based on these transmission values, and those which pass a chi-squared significance test are put into the set of candidate variables. The maximum size of the candidate set is a user-determined parameter in this algorithm. When more than this number of significant variables are found, the most significant variables are saved and the others are discarded, The H2 analysis amounts to sequentially constructing approximate joint probability distributions for two independent variables (features or measurements) and the dependent variable (classifier output) based on the training data. These distributions can be decomposed into conditional distributions for each class (a step we will take in the next section), and allow us to calculate the degree to which "class" is explained by each possible pair of features or measurements., Based on the EDA, we can make two initial simplifications to this implementation. First, reduce the number of inputs as above for the MLPs. Second, select the number of elements to allow in the prototype layer based on the number of non-zero cells in the frequency table calculated for the dependency set during EDA… This significant reduction in prototype elements still allowed us to train perfectly and produced a generalization rate of 93%.,  Now we would like to directly translate our joint probability distribution for the nominal variables into a neural classifier design for the quantitative features and measurements…. As an example, we can take the vectors Cj that partition our quantitative space and use them directly as instars to prototype elements in a Counterprop NN. Thus, the prototypes are completely specified by the binning scheme and the actual range of variable values observed. The outstar weights from each prototype element may be directly assigned, using the conditional probabilities of the classes in that region developed in the EDA., wherein a neural network is structured on the basis of a probability computed from pairs of independent features (signals) in a time series (over a period of time – namely, the year 1974) through a statistical dynamic analysis process which determines a joint probability distribution involving each pair of independent features and the dependent (target/output) variable that is used to determine a probabilistic degree to which the dependent variable (i.e., that the events in the time series associated with the two independent features predict the dependent variable) is explained by the two independent features (chi-squared test, information theoretic analysis), wherein this probability is used to partially connect the neural network in a pre-structuring function on the basis of a certain number of pairs of features pre-specified for inclusion in that topology such that the probabilistic threshold that determines the connectivity corresponds to the reduction in uncertainty (table 1) computed from the joint and conditional probabilities (an information theoretic computation of a probabilistic confidence level or a chi-squared test) such that the particular probabilistic characterization of the least important pair over the specified number of pairs determines the threshold, and wherein the neural network that is thereby pre-structured is trained.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Andre and Yanqing to incorporate the teachings of Lendaris to partially connect a sequence of layers in a neural network in which the connectivity from the input layer to the hidden layer is determined by the probability of occurrence of events in a pair of the different sources of signals (collected over a period of time), exceeding a threshold and train the neural network structured according to that partial connectivity.  The modification would have been obvious because one of ordinary skill would have been motivated to improve training and generalization of a neural network by pre-structuring it according to the statistical significance of pair-wise associations of input features into that neural network, including when those features are derived from a time series, thereby mitigating the exponential scaling of computation over the number of features for a given network topology having a specified number of input nodes and specified number of hidden nodes (Lendaris, [Abstract, p. 402, Section 1, p. 411, Section 8.3.1, p. 412, Section 9]).
However, Andre, Yanqing, and Lendaris do not teach … to determine frequencies of subsequent occurrence of events 5within the period of time for different combinations of pairs of sources of signals;  36MERL-3116 Guo et al. …7… subsequent … based on the 9frequencies of subsequent occurrence of events within the period of time;  10… …subsequent … In other words, although Lendaris teaches the determination of neural network connectivity according to a statistical analysis of (pairwise) features obtained from a time series, he does not disclose that this statistical analysis is focused on temporal causality (i.e., that one feature at a given time is analyzed statistically vis a vis another feature at a later time).
However, Conant, in the analogous environment of designing structured neural networks, teaches … evaluate the signals from the source of signals collected over a 4period of time to determine frequencies of subsequent occurrence of events 5within the period of time for different combinations of pairs of sources of signals;  36MERL-3116 Guo et al. 7to determine probabilities of the subsequent occurrence of events for 8different combinations of the pairs of sources of signals based on the 9frequencies of subsequent occurrence of events within the period of time;  10to compare the probabilities of the subsequent occurrence of events 11for different combinations of pairs of sources of signals with the threshold to 12determine a connectivity structure …;  ([p. 100,”Introduction”, p. 102, “Combinatorial considerations in PRA and DA”, p. 111, “DEDUCE”, p. 112, “The use of intermediate results from dependency analysis”, p. 115, “Results”] In the use of DA for analysis of dynamic structure, one starts with N conditional distributions pi: {pi=p(j ISV), j =1,2, ... , N} each one involving the N variables of the system plus one variable j delayed in time, so that pi embodies the relationship between variable j at one time and all N variables of the system at one time unit earlier. It is assumed that the set of these distributions implicitly characterizes the dynamic behavior of the system; DA is the tool which extracts a description of that dynamic behavior from the conditional distributions, or in practice from a time record of actual behavior which implicitly represents them. The end result of DA for dynamic analysis is a set of N "dependencies" denoted by a variable, an arrow, and a set of variables, {j+-DU), D(j)cSV,j= I, ... , N}, The first or "data" constraint arises from the fact that since the analysis of a system is based upon observed probabilities (or frequencies) of the variables and sets of variables, then these probabilities must be supported by enough data to make them credibly representative of the inherent behavior of the system., Its result is a dependency set for each of the N variables, revealing the supposedly causal structure of the system by showing for each variable the other variables which serve as its predictors. In addition to the sets D(j), EDA also reveals for each j the strength of the predictive relationship, measured by T(j: D(j». If this transmission equals H(j), the entropy of j, then j is completely determined by the variables in DU), and at the opposite extreme if T(j: D(j» is only marginally sufficient to meet the chi-squared criterion for significance then the predictive relationship is weak., Consequently DA not only reveals the dependency set D(j) as a simple set of variables, but constructs it by adding variables in decreasing order of significance and finding the marginal significance of each variable as it is added. Naturally the last variable to be added to the set is usually the least significant member, although this is not mathematically necessary. As a consequence it is possible to see, from the intermediate DA results, which variables might be suspect due to a marginal significance close to the preset threshold., The errors made by EDA were almost entirely those in which extra variables were included. A close analysis of the experiments showed that when this occurred it was always with a chi-squared significance close to the margin of acceptability. Consequently these errors seem to be a consequence of the arbitrary setting of the probability threshold for DA (P=0.99), a wherein dynamic analysis of a multivariate time series for the purpose of determining causal dependencies between features is based on the determination of the conditional probabilities that relate a feature (or set of features) at a particular time to a subsequent or a previous time in which these probabilities characterize the frequency of co-occurrence of events at temporal offsets over a period of time (e.g., over a year) in which the important dependencies/connectivities between causally related pairs of features is determined through the evaluation of a statistical probability metric relative to a threshold.) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Andre, Yanqing, and Lendaris to incorporate the teachings of Conant to partially connect a sequence of layers in a neural network in which the connectivity from the input layer to the hidden layer is determined by the probability of the subsequent occurrence of events in a pair of the different sources of signals according to a frequency of the subsequent occurrence of the events collected over a period of time with the selected connectivities determined by a probability that exceeds a threshold.  The modification would have been obvious because one of ordinary skill would have been motivated to improve the efficiency of performing dependency analysis of multi-variate time series by overcoming combinatorial constraints through dynamic analysis of features which identifies important causal relationships and connections through features using statistical techniques with various levels of statistical support (Conant, [Abstract, pp. 102-104, “Combinatorial Considerations in PRA and DA”, p. 123, “Conclusion”]).

IIn regards to claim 8, the rejection of claim 1 is incorporated and Andre further teaches further comprising:  2a neural network trainer configured 3to evaluate the signals from the source of signals collected over a 4period of time … 10to form the neural network according to the connectivity structure of the neural network, such that a number of nodes in the input layer equals a 37MERL-3116 Guo et al. 12first multiple of a number of the source of signals in the system, and a 13number of nodes in the first hidden layer following the input layer equals a 14second multiple of the number of the sources of signals, wherein the input 15layer is partially connected to the first hidden layer according to the 16connectivity structure; and  17to train the neural network using the signals collected over the period 18of time.  ([p. 954, Section 4.2, p. 958, Section 6.1.2, Figure 5] The physical relationship between input and output data is established through the weight adaptation of the parameters k, n as well as p0–p4 during the training of Eq. (5)., All data were measured directly in the vehicle and contain all required signals like current, voltage and temperature. Most of the sets have a length of about five to seven hours and are therefore very suitable for a training or validation., wherein the pre-structured neural network SNN1 is trained using data collected over a specified period of time in which the structure of that neural network is determined according to a priori knowledge of system dynamics such that the neural topology exhibits a number of nodes in the input layer (6) which is a multiple of the number of source signals for that neural network (3 – T, SOC, imeas) and such that the nodes in the first hidden layer (3) is also a multiple of the number of source signals (i.e., multiple = 1).) 
However, Andre and Yanqing do not explicitly disclose to determine frequencies of subsequent occurrence of events 5within the period of time for different combinations of pairs of sources of 6signals;  7to compare the frequencies of the subsequent occurrence of events for 8different combinations of pairs of sources of signals with the threshold to 9determine a connectivity structure of the neural network; .  The connectivity is determined by a priori information about system dynamics rather than by statistics in Andre. Yanqing does not make use of partial connectivity.
However, Lendaris, in the analogous environment of designing structured neural networks, teaches a neural network trainer configured 3to evaluate the signals from the source of signals collected over a 4period of time … 7to compare … the subsequent occurrence of events for 8different combinations of pairs of sources of signals with the threshold to 9determine a connectivity structure of the neural network;  10to form the neural network according to the connectivity structure of the neural network, such that a number of nodes in the input layer equals a 37MERL-3116 Guo et al. 12first multiple of a number of the source of signals in the system, and a 13number of nodes in the first hidden layer following the input layer equals a 14second multiple of the number of the sources of signals, wherein the input 15layer is partially connected to the first hidden layer according to the 16connectivity structure; and  17to train the neural network using the signals collected over the period 18of time.  ([p. 408, Section 8.1, p. 406, Section 4.3, p. 406, Section 5, p. 407, Section 5.1, p. 409, Section 8.2, pp. 411-412, Section 8.3.2, p. 412, Section 8.3.3,Table 1] During the 1974 project, 57 "features" were empirically extracted from the FDP data points[l 5]., Keeping in mind that the likely candidate NN structure to use in the absence of a priori information is the fully connected (structure-zero) NN, these results further press home the idea of obtaining as much apriori information as possible for use in selecting the NN structure before training. Similar positive results are demonstrated for the other research predictions suggested earlier., An attribute of the EDA that turns out being especially useful in our NN design context is that in the process of determining the structural information (which is based on "binning" the continuous data to develop categorical data upon which the EDA operates), it also provides substantial information about the probability densities represented by the data. The latter may be incorporated in the NN design; a key benefit of such transition to the NN with this information is that the NN, in contrast to EDA, operates with the full metric (as opposed to categorical) problem domain., The dynamic algorithm consists of a pair of heunstics labeled H2 and H3, together with some minimal form of reconstructability analysis. For each dependent variable in the analysis, H2 calculates the three-way transmission between the dependent variable and every possible pair of independent variables in the analysis. These values provide a measure of how much the knowledge of independent variables reduces uncertainty about the dependent variable. The pairs of independent variables are sorted based on these transmission values, and those which pass a chi-squared significance test are put into the set of candidate variables. The maximum size of the candidate set is a user-determined parameter in this algorithm. When more than this number of significant variables are found, the most significant variables are saved and the others are discarded, The H2 analysis amounts to sequentially constructing approximate joint probability distributions for two independent variables (features or measurements) and the dependent variable (classifier output) based on the training data. These distributions can be decomposed into conditional distributions for each class (a step we will take in the next section), and allow us to calculate the degree to which "class" is explained by each possible pair of features or measurements., Based on the EDA, we can make two initial simplifications to this implementation. First, reduce the number of inputs as above for the MLPs. Second, select the number of elements to allow in the prototype layer based on the number of non-zero cells in the frequency table calculated for the dependency set during EDA… This significant reduction in prototype elements still allowed us to train perfectly and produced a generalization rate of 93%.,  Now we would like to directly translate our joint probability distribution for the nominal variables into a neural classifier design for the quantitative features and measurements…. As an example, we can take the vectors Cj that partition our quantitative space and use them directly as instars to prototype elements in a Counterprop NN. Thus, the prototypes are completely specified by the binning scheme and the actual range of variable values observed. The outstar weights from each prototype element may be directly assigned, using the conditional probabilities of the classes in that region developed in the EDA., wherein a neural network is structured on the basis of a probability computed from pairs of independent features (signals) in a time series (over a period of time – namely, the year 1974) through a statistical dynamic analysis process which determines a joint probability distribution involving each pair of independent features and the dependent (target/output) variable that is used to determine a probabilistic degree to which the dependent variable (i.e., that the events in the time series associated with the two independent features predict the dependent variable) is explained by the two independent features (chi-squared test, information theoretic analysis), wherein this probability is used to partially connect the neural network in a pre-structuring function on the basis of a certain number of pairs of features pre-specified for inclusion in that topology such that the probabilistic threshold that determines the connectivity corresponds to the reduction in uncertainty (table 1) computed from the joint and conditional probabilities (an information theoretic computation of a probabilistic confidence level or a chi-squared test) such that the particular probabilistic characterization of the least important pair over the specified number of pairs determines the threshold, and wherein the neural network that is thereby pre-structured is trained.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Andre and Yanqing to incorporate the teachings of Lendaris to partially connect a sequence of layers in a neural network in which the connectivity from the input layer to the hidden layer is determined by the probability of occurrence of events in a pair of the different sources of signals (collected over a period of time), exceeding a threshold and train the neural network structured according to that partial connectivity.  The modification would have been obvious because one of ordinary skill would have been motivated to improve training and generalization of a neural network by pre-structuring it according to the statistical significance of pair-wise associations of input features into that neural network, including when those features are derived from a time series, thereby mitigating the exponential scaling of computation over the number of features for a given network topology having a specified number of input nodes and specified number of hidden nodes (Lendaris, [Abstract, p. 402, Section 1, p. 411, Section 8.3.1, p. 412, Section 9]).
However, Andre, Yanqing, and Lendaris do not teach to determine frequencies of subsequent occurrence of events 5within the period of time for different combinations of pairs of sources of 6signals; … 7… the frequencies of… In other words, although Lendaris teaches the determination of neural network connectivity according to a statistical analysis of (pairwise) features obtained from a time series, he does not disclose that this statistical analysis is focused on temporal causality (i.e., that one feature at a given time is analyzed statistically vis a vis another feature at a later time) or, explicitly, that the probabilities correspond to a frequency of occurrence.
However, Conant, in the analogous environment of designing structured neural networks, teaches …evaluate the signals from the source of signals collected over a 4period of time to determine frequencies of subsequent occurrence of events 5within the period of time for different combinations of pairs of sources of 6signals;  7to compare the frequencies of the subsequent occurrence of events for 8different combinations of pairs of sources of signals with the threshold to 9determine a connectivity structure …;  10 ([p. 100,”Introduction”, p. 102, “Combinatorial considerations in PRA and DA”, p. 111, “DEDUCE”, p. 112, “The use of intermediate results from dependency analysis”, p. 115, “Results”] In the use of DA for analysis of dynamic structure, one starts with N conditional distributions pi: {pi=p(j ISV), j =1,2, ... , N} each one involving the N variables of the system plus one variable j delayed in time, so that pi embodies the relationship between variable j at one time and all N variables of the system at one time unit earlier. It is assumed that the set of these distributions implicitly characterizes the dynamic behavior of the system; DA is the tool which extracts a description of that dynamic behavior from the conditional distributions, or in practice from a time record of actual behavior which implicitly represents them. The end result of DA for dynamic analysis is a set of N "dependencies" denoted by a variable, an arrow, and a set of variables, {j+-DU), D(j)cSV,j= I, ... , N}, The first or "data" constraint arises from the fact that since the analysis of a system is based upon observed probabilities (or frequencies) of the variables and sets of variables, then these probabilities must be supported by enough data to make them credibly representative of the inherent behavior of the system., Its result is a dependency set for each of the N variables, revealing the supposedly causal structure of the system by showing for each variable the other variables which serve as its predictors. In addition to the sets D(j), EDA also reveals for each j the strength of the predictive relationship, measured by T(j: D(j». If this transmission equals H(j), the entropy of j, then j is completely determined by the variables in DU), and at the opposite extreme if T(j: D(j» is only marginally sufficient to meet the chi-squared criterion for significance then the predictive relationship is weak., Consequently DA not only reveals the dependency set D(j) as a simple set of variables, but constructs it by adding variables in decreasing order of significance and finding the marginal significance of each variable as it is added. Naturally the last variable to be added to the set is usually the least significant member, although this is not mathematically necessary. As a consequence it is possible to see, from the intermediate DA results, which variables might be suspect due to a marginal significance close to the preset threshold., The errors made by EDA were almost entirely those in which extra variables were included. A close analysis of the experiments showed that when this occurred it was always with a chi-squared significance close to the margin of acceptability. Consequently these errors seem to be a consequence of the arbitrary setting of the probability threshold for DA (P=0.99), a wherein dynamic analysis of a multivariate time series for the purpose of determining causal dependencies between features is based on the determination of the conditional probabilities that relate a feature (or set of features) at a particular time to a subsequent or a previous time in which these probabilities characterize the frequency of co-occurrence of events at temporal offsets over a period of time (e.g., over a year) in which the important dependencies/connectivities between causally related pairs of features is determined through the evaluation of a statistical probability metric relative to a threshold.) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Andre, Yanqing, and Lendaris to incorporate the teachings of Conant to partially connect a sequence of layers in a neural network in which the connectivity from the input layer to the hidden layer is determined by the probability of the subsequent occurrence of events in a pair of the different sources of signals according to a frequency of the subsequent occurrence of the events collected over a period of time with the selected connectivities determined by a probability that exceeds a threshold.  The modification would have been obvious because one of ordinary skill would have been motivated to improve the efficiency of performing dependency analysis of multi-variate time series by overcoming combinatorial constraints through dynamic analysis of features which identifies important causal relationships and connections through features using statistical techniques with various levels of statistical support (Conant, [Abstract, pp. 102-104, “Combinatorial Considerations in PRA and DA”, p. 123, “Conclusion”]).

In regards to claim 11, the rejection of claim 1 is incorporated and Andre, Yanqing, and Lendaris do not teach wherein the subsequent occurrence of the events in 2the pair of the different sources of signals is a consecutive occurrence of events in 3a time sequence of all events of the system.  In other words, although Lendaris teaches the determination of neural network connectivity according to a statistical analysis of (pairwise) features obtained from a time series, he does not disclose that this statistical analysis is focused on temporal causality (i.e., that one feature at a given time is analyzed statistically vis a vis another feature at a later time).
However, Conant, in the analogous environment of designing structured neural networks, teaches wherein the subsequent occurrence of the events in 2the pair of the different sources of signals is a consecutive occurrence of events in 3a time sequence of all events of the system.   10 ([p. 101,”Introduction”] The dynamic version of DA can be described in the language of Mask Analysis!" by observing that DA begins with a set of N masks, each of which includes N variables at the reference time plus one variable delayed by one time unit, and for each of these masks DA derives the optimum submask containing the delayed variable plus the smallest acceptable set of variables at the reference time., wherein dynamic analysis of a multivariate time series for the purpose of determining causal dependencies between features is based on the determination of the conditional probabilities that relate a feature (or set of features) at a particular time an offset (subsequent) time in which, although this method generalizes to any number of delay steps, this method of analysis (especially DA but also in a more general sense Mask Analysis) is focused on temporal dependencies associated with a single delay increment.) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Andre, Yanqing, and Lendaris to incorporate the teachings of Conant to partially connect a sequence of layers in a neural network in which the connectivity from the input layer to the hidden layer is determined by the probability of the subsequent occurrence of events in a pair of the different sources of signals according to a frequency of the subsequent occurrence of the events collected over a period of time in which the subsequent occurrence is a consecutive occurrence with the selected connectivities determined by a probability that exceeds a threshold.  The modification would have been obvious because one of ordinary skill would have been motivated to improve the efficiency of performing dependency analysis of multi-variate time series by overcoming combinatorial constraints through dynamic analysis of features which identifies important causal relationships and connections through features using statistical techniques with various levels of statistical support including causal dependencies corresponding to consecutive occurrences (Conant, [Abstract, pp. 102-104, “Combinatorial Considerations in PRA and DA”, p. 123, “Conclusion”]).

Claim 12 is also rejected because it is just method implementation of the same subject matter of claim 1 which can be found in Andre, Yanqing, Lendaris, and Conant. It is noted that claim 12 also recites stored instructions for implementing the method which can also be found in Andre (e.g., [p. 959, Section 6.3] The second one can be evaluated by comparing the developed SNN, consisting of two basic arithmetic functions, with the extended Kalman filter (EKF) design. It is obviously, that the SOH estimator design by SNN has a lower computational complexity. To calculate the Kalman gain for a nth order state space model, a covariance matrix of n n must be solved at every sample step. This can be evidenced by simulating the 8700 s of the drive profile from Fig. 10 on a commercial computer and measuring the required computation time. Whereas the SNN takes 4.7 s, applying the EKF on the same profile requires 13.5 s for computation.)

Claim 13/12 is also rejected because it is just method implementation of the same subject matter of claim 2/1 which can be found in Andre, Yanqing, Lendaris, and Conant.

Claim 16/12 is also rejected because it is just method implementation of the same subject matter of claim 11/1 which can be found in Andre, Yanqing, Lendaris, and Conant.

Claim 17 is also rejected because it is just computer readable storage medium implementation of the same subject matter of claim 1 which can be found in Andre, Yanqing, Lendaris, and Conant. It is noted that claim 12 also recites stored instructions for implementing the method which can also be found in Andre (e.g., [p. 959, Section 6.3] The second one can be evaluated by comparing the developed SNN, consisting of two basic arithmetic functions, with the extended Kalman filter (EKF) design. It is obviously, that the SOH estimator design by SNN has a lower computational complexity. To calculate the Kalman gain for a nth order state space model, a covariance matrix of n n must be solved at every sample step. This can be evidenced by simulating the 8700 s of the drive profile from Fig. 10 on a commercial computer and measuring the required computation time. Whereas the SNN takes 4.7 s, applying the EKF on the same profile requires 13.5 s for computation.)

Claim 18/12 is also rejected because it is just method implementation of the same subject matter of claim 11/1 which can be found in Andre, Yanqing, Lendaris, and Conant.

  Claims 3 and 9 are rejected under 35 U.S.C. 103 as being unpatentable over Andre,  in view Yanqing, in view of Lendaris, in view of Conant, and in further view of Egri et al. (“Cross-correlation based clustering and dimension reduction of multivariate time series”, INES 21st international Conference on Intelligent Engineering Systems, October, 2017, pp. 1-6), hereinafter referred to as Egri.  

In regards to claim 3, the rejection of claim 2 is incorporated and Andre, Yanqing, Lendaris, and Conant do not further teach wherein the probability of subsequent occurrence of 2the events of signal Si followed by events of signal Sj can be defined as 
    PNG
    media_image1.png
    68
    703
    media_image1.png
    Greyscale
 4where M is number of signals, cij is the number of times events of signal Si 5followed by events of signal Sj.  Andre and Yanqing do not determine neural connectivity according to a probability. Lendaris discloses neural connectivity according to statistical analysis but uses an information theoretic method in lieu of a statistically robust sample set and does not characterize the pertinent statistics according to a cross correlation between temporally offset features. Conant uses a cross-entropy (and chi-square) statistical metric for dependency analysis which does not have the form of the recited equation.
However,  Egri, in the analogous environment of using statistical analysis of multivariate time series for feature reduction, teaches wherein the probability of subsequent occurrence of 2the events of signal Si followed by events of signal Sj can be defined as 
    PNG
    media_image1.png
    68
    703
    media_image1.png
    Greyscale
 4where M is number of signals, cij is the number of times events of signal Si 5followed by events of signal Sj. ([p. 241, Section II, p. 242, Section IIIA, p. 243, Section IIIB] A time series is univariate if d = 1 and multivariate if d ≥ 2. A natural representation of a multivariate time series is a data matrix A ∈ R n×d , where n (the number of rows) denotes the cardinality of the index set, i.e., the number of time stamps and d (the number of columns) refers to the number of variables (also known as attributes, features or sensors),….
    PNG
    media_image2.png
    505
    614
    media_image2.png
    Greyscale


We note that this similarity measure is a symmetric measure, thus sim(A.,i, A.,j ) = sim(A.,j , A.,i) and its values ranges from 0 (no relation) to 1 (strong relation)…. 
For any (dis)similarity matrix A, a weighted undirected graph G can be constructed: every vertex i ∈ V (G) corresponds to an attribute (i.e. the A.,i column of matrix A) and two vertices i and j are connected with an edge ei,j ∈ E(G) of weight w(ei,j ) := sim(A.,i, A.,j ). In order to uncover the connections in the multi-dimensional time series A, in other words, to find attributes that are similar (i.e., the clusters) we have to reveal the community structure of the graph G. …we only keep the edges with weight above a certain threshold δ (e.g. we remove an ei,j if w(ei,j ) ≤ δ, then we can identify the communities (or clusters) as the connected components of G0 (similarly to the concept of clique cluster from [19])., wherein the cross correlation metric is computed for pair-wise, temporally offset features (signals) in a multi-variate time series such that only graph network connections between these temporally offset features is retained if the similarity function computed from the cross correlation exceeds a threshold and wherein the similarity function performs inherently the same function as the recited equation because the absolute value of cross correlation (as seen in the similarity metric) between a feature at time t and another feature at time t+tau is an expectation (probability) of co-occurrence of events represented by those two features (particularly for zero mean processes with the signals generated by the presence of a particular event represented bimodally)  and because the denominator of the recited equation is independent of i and j (i.e., of the particular two signals chosen for the multi-variate analysis) so that the denominator may be subsumed in the threshold that determines the connectivity without changing the underlying functionality of the claims and wherein it is also noted that the similarity maps to values between 0 and 1 and may be interpreted as a probability that expresses a strength/likelihood of association between the two features.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Andre, Yanqing, Lendaris, and Conant to incorporate the teachings of Egri to partially connect a sequence of layers in a neural network in which the connectivity from the input layer to the hidden layer is determined by the probability of occurrence of events in a pair of the different sources of signals given by 
    PNG
    media_image1.png
    68
    703
    media_image1.png
    Greyscale
 4where M is number of signals, cij is the number of times events of signal Si 5followed by events of signal Sj. The modification would have been obvious because one of ordinary skill would have been motivated to improve dimension reduction and noise tolerance in determining temporal signal dependencies among variables in a multi-variate time series by using a feature similarity measure based on the cross-correlation function that characterizes a statistical/probabilistic strength of the dependencies (Egri, [Abstract, p. 241, Section I, p. 245, Section V, Figure 2, Figure 3]).



In regards to claim 9, Andre, Yanqing, Lendaris, and Conant do not explicitly teach wherein the trainer forms a signal connection matrix 2representing the frequencies of the subsequent occurrence of the events.  Although Lendaris and Conant teach the evaluation of the dependency between all possible pairs of features/signals with Conant, in particular teaching this in the context of a causal dependency relationship, and with Lendaris, in particular, teaching the use of this relationship in forming the pre-structured topology of a neural network, neither one discloses a matrix representation of this dependency.
However,  Egri, in the analogous environment of using statistical analysis of multivariate time series for feature reduction, teaches wherein the trainer forms a signal connection matrix 2representing the frequencies of the subsequent occurrence of the events.   ([p. 241, Section II, p. 242, Section IIIA, p. 243, Section IIIB] A time series is univariate if d = 1 and multivariate if d ≥ 2. A natural representation of a multivariate time series is a data matrix A ∈ R n×d , where n (the number of rows) denotes the cardinality of the index set, i.e., the number of time stamps and d (the number of columns) refers to the number of variables (also known as attributes, features or sensors),… In particular, the sample cross-correlations of two times series can be estimated by averaging the product of samples measured from one process and samples measured from the other [31]:…We note that this similarity measure is a symmetric measure, thus sim(A.,i, A.,j ) = sim(A.,j , A.,i) and its values ranges from 0 (no relation) to 1 (strong relation)…., For any (dis)similarity matrix A, a weighted undirected graph G can be constructed: every vertex i ∈ V (G) corresponds to an attribute (i.e. the A.,i column of matrix A) and two vertices i and j are connected with an edge ei,j ∈ E(G) of weight w(ei,j ) := sim(A.,i, A.,j ). In order to uncover the connections in the multi-dimensional time series A, in other words, to find attributes that are similar (i.e., the clusters) we have to reveal the community structure of the graph G. …we only keep the edges with weight above a certain threshold δ (e.g. we remove an ei,j if w(ei,j ) ≤ δ, then we can identify the communities (or clusters) as the connected components of G0 (similarly to the concept of clique cluster from [19])., wherein the cross correlation metric is computed for pair-wise, temporally offset features (signals) in a multi-variate time series (with the cross correlation metric representing the frequency/expectation/probability of co-occurrence of temporally offset events in two signals/features) such that only graph network connections between these temporally offset features is retained if the similarity function computed from the cross correlation exceeds a threshold, wherein the similarity function maps the cross-correlation metric to a strength of connection metric normalized between 0 and 1 which is being interpreted as a probability/likelihood that a given value of a feature at a particular time is associated with a value of a subsequent feature at a later time – i.e., the cross-correlation is a probabilistic expectation measure which characterizes a likelihood of the co-occurrence of events represented by those two features, and wherein the weight e_(I,j) is a representation between the (causal) dependency between any given two features/signals which is a matrix representation for that dependency with matrix element w(e_i,j).)
 It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Andre, Yanqing, and Lendaris to incorporate the teachings of Egri to partially connect a sequence of layers in a neural network in which the connectivity from the input layer to the hidden layer is determined by the probability/frequency of occurrence of subsequent events in one signal in a pair of the different sources of signals relative to the other signal in that pair with the resultant connectivity represented by a matrix. The modification would have been obvious because one of ordinary skill would have been motivated to improve dimension reduction and noise tolerance in determining temporal signal dependencies among variables in a multi-variate time series by using a feature similarity measure based on the cross-correlation function that characterizes a statistical/probabilistic strength of the dependencies in a dissimilarity matrix (Egri, [Abstract, p. 241, Section I, p. 245, Section V, Figure 2, Figure 3]).

Claims 4, 5, 14, and 19 are rejected under 35 U.S.C. 103 as being unpatentable over Andre,  in view Yanqing, in view of Lendaris, in view of Conant, and in further view of Sun et al. (“Compressed time delay neural network for small-footprint keyword spotting”, Interspeech, 2017, pp. 3607-3611), hereinafter referred to as Sun.  

1In In regards to claim 4, the rejection of claim 2 is incorporated and Andre, Yanqing, Lendaris, and Conant do not further teach wherein the neural network is a time delay neural 2network (TDNN), and wherein the multiple for the number of nodes in the input 3layer equals a number of time steps in the delay of the TDNN.  Andre, Lendaris, and Conant do not teach a TDNN even though Andre teaches the determination and prediction of temporal behavior of the system. Yanqing teaches the use of a single time-delay step in his neural network to form the inputs into the system.
However, Sun, in the analogous environment of designing a TDNN for processing a time series of features for a recognition application, teaches wherein the neural network is a time delay neural 2network (TDNN), and wherein the multiple for the number of nodes in the input 3layer equals a number of time steps in the delay of the TDNN.  ([p. 2, Section 2, p. 2, Section 2, p. 3, section 3.4, Figure 2] The log mel filter-bank energies (LFBEs) are computed as input acoustic features. We extract 20-dimensional LFBEs over 25ms frames with a 10ms frame shift., By contrast, a TDNN processes the information from the context window in a hierarchical way. Figure 2 shows the architecture of our TDNN based acoustic model for keyword spotting. The input layer of TDNN focuses on modeling a narrow context, while the deeper layers of TDNN work on modeling wider temporal context information. For each hidden layer of TDNN, its parameters are tied across different time stamps, with its lower layers trained to learn translation invariant feature forms., Instead, we apply SVD to approximate all hidden layers of our TDNN network, as well as the input layer. Figure 2 shows bottleneck layers (labeled by ‘BN’) are added to our TDNN model…. The dimensions of the linear bottleneck layers are selected to meet the parameter budget, and they are within a reasonable range to maintain the performance of full-rank TDNN measured by frame accuracy on the cross-validation set., wherein a TDNN is designed with bottleneck layers such that for the first bottleneck layer (hidden layer) (Figure 2) each node of that layer is connected to a sub-sampling component of the signal (each sub-sampling interpreted as corresponding to a distinct signal) which is processed through the same number of time delay steps as each other sub-sampling component of the signal so that the total number of inputs is equal to the number of time steps (5 in Figure 2) multiplied by the number of nodes in the first hidden layer but such that, in general, a dimension of the hidden layer is selected on the basis of the parameter budget (which still accommodates a full-rank TDNN performance level) which does not exclude the case in which the number of nodes in the input layer equals the product of the number of delay steps and the number of nodes in the hidden layer.)
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Andre, Yanqing, Lendaris, and Conant to incorporate the teachings of Sun to partially connect a sequence of layers in a neural network in which the number of nodes in the input layer equals the number of time delays times the number of nodes in the hidden layer of a TDNN.  The modification would have been obvious because one of ordinary skill would have been motivated to improve the computational efficiency and recognition performance of a TDNN by the inclusion of a bottleneck layer generalization in a recognition process associated with time series features in which the number of nodes in the bottleneck layer is related to number of component signals in the input according to the number of delay steps in the TDNN for a system which predicts a state of the system based on a time series of signal inputs (Sun, [Abstract, p. 3, Section 3.4, p. 4, Section 5, Table 2]).

In regards to claim 5, the rejection of claim 4 is incorporated and Andre, Yanqing, Lendaris, and Conant do not further teach whereinIwhere the TDNN is a time delay feedforward neural 2network trained based on a supervised learning or a time delay auto-encoder neural 3network trained based on an unsupervised learning.  Andre, Lendaris, and Conant do not teach a TDNN although Andre and Lendaris teach a pre-structured feed forward neural network structure. Yanqing teaches the use of a single time-delay step in his (feed forward) neural network to form the inputs into the system.
However, Sun, in the analogous environment of designing a TDNN for processing a time series of features for a recognition application, teaches whereinIwhere the TDNN is a time delay feedforward neural 2network trained based on a supervised learning or a time delay auto-encoder neural 3network trained based on an unsupervised learning.  ([p. 3, Section 3.3, p. 3, section 3.4, p. 3, Section 3.1, Figure 2] Figure 3 shows the architecture for our multi-task training scheme, with an auxiliary training task with the LVCSR targets. As a result, we prepare two sets of targets for TDNN training data. One set is used for keyword spotting, which represents HMM states shown in Figure 1. The other set consists the LVCSR targets., To train TDNN with SVD approximation, we start with training a larger size full-rank TDNN at first. After that, we add linear bottleneck layers initialized by SVD of the full-rank affine matrices to TDNN, one layer at time, starting from the input layer. The dimensions of the linear bottleneck layers are selected to meet the parameter budget, and they are within a reasonable range to maintain the performance of full-rank TDNN measured by frame accuracy on the cross-validation set. One epoch of pre-training is applied at each time when a bottleneck layer is added. Finally the SVD-compressed TDNN is trained with additional epochs for the purpose of fine-turning., We use exponential decaying learning rate scheduling for TDNN training, including all stages of LVCSR TDNN training, full-rank TDNN multi-task training, and SVD-compressed TDNN multi-task fine-tuning. The initial learning rate is set to be 0.008 for both LVCSR TDNN training and full-rank TDNN multi-task training, and 0.000125 for SVD-compressed TDNN multi-task fine-tuning. The decaying factor is 2 for the first few epochs, and it is reduced to 1.2 for annealing in the remaining epochs., wherein the TDNN has a feed-forward architecture (Figure 2) and is trained using transfer learning and multi-task learning but specifically employs supervised learning because of the indication of a cross-validation test set, the use of two sets of targets (interpreted as labeled data) for the training, and the structure of the training process (decaying factor, learning rates, etc.).) 
It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Andre, Yanqing, Lendaris, and Conant to incorporate the teachings of Sun to partially connect a sequence of layers in a neural network in which the number of nodes in the input layer equals the number of time delays times the number of nodes in the hidden layer of a TDNN trained with supervised learning.  The modification would have been obvious because one of ordinary skill would have been motivated to improve the computational efficiency and recognition performance of a TDNN by the inclusion of a bottleneck layer generalization in a recognition process associated with time series features in which the number of nodes in the bottleneck layer is related to number of component signals in the input according to the number of delay steps in the TDNN in which the feed-forward structure is trained through supervised learning techniques including multi-task learning (Sun, [Abstract, p. 3, Section 3.4, p. 4, Section 5, Table 2]).

Claim 14/13 is also rejected because it is just method implementation of the same subject matter of claims 4/2 and 5/4 which can be found in Andre, Yanqing, Lendaris, Conant, and Sun.

Claim 19/17 is also rejected because it is just method implementation of the same subject matter of claims 4/2 and 5/4 which can be found in Andre, Yanqing, Lendaris, Conant, and Sun.

Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over Andre,  in view Yanqing, in view of Lendaris, in view of Conant, and in further view of Adali et al. (“Nox and CO Prediction in Fossil Fuel Plants by Time Delay Neural Networks”, Integraded Computer-Aided Engineering, 6(1), 1999, pp. 27-39), hereinafter referred to as Adali.  





In regards to claim 10, the rejection of claim 1 is incorporated and Andre, Yanqing, Lendaris, and Conant do not further teach wherein the system is a manufacturing production 2line including one or combination of a process manufacturing and discrete 3manufacturing. Andre and Yanqing teach the application of this system to a recognition process in a vehicle system. Lendaris teaches a generalized approach for pre-structuring neural networks that is applied to a recognition problem. Conant also teaches a generalized approach for dependency analysis and applies it to a network of automata. It is noted, however, that the claim recites an intended use which does not have patentable weight.
However, Adali, in the analogous art of using a structured TDNN for detecting system problems, teaches wherein the system is a manufacturing production 2line including one or combination of a process manufacturing and discrete 3manufacturing. ([pp. 28, “Combustion in fossil fuel plants”, p. 30, “Variable Window TDNN”, p. 33, “Data Set”, Figure 1, Figure 3, Figure 4] Combustion in a fossil fueled generation plant is the process of controlling the combustible of a fuel and the oxygen of the air at such a rate as to produce useful heat energy. The principle combustible constituents are elemental carbon, hydrogen, and their compounds. In the combustion process, the compounds and the elements are burned to carbon dioxide and water vapor…. The main types of fossil fuels are; natural gas, oil, and coal. Each fuel type entering the furnace has a unique set of properties such as temperature, heat capacity, net heating value, and chemical composition., Due to the physical nature of the NOx and CO formation process, there is a certain time delay between the disturbance (system inputs) and the system response. As a natural result of this delay, a standard tapped delay line which windows the input data such that the last m samples of the input are used to predict the output does not fully utilize the available information and includes redundant data….For selection of the time delay values for the variable tapped delay line, we determine the cross-correlation between the inputs and the output. We center the delay line around the point at which maximum cross-correlation occurs, and adjust the window width such that it matches with the spread of the cross-correlation function., The time series data we have used in our experiments are obtained from a coal burning fossil fuel power plant. The furnace is a circulating uidized bed (CFB) for production of electricity and district heating. There are seven variables | measured during the system operation | which are sampled with a 1 Hz frequency. These variables are:…, wherein a TDNN is designed with variable delay line time steps determined according to an initial pair-wise cross-correlation analysis to determine the pertinent delays for the causal associations between different features/signals in a multi-variate time series such that this initial dependency analysis pre-structures the TDNN according to a co-occurrence of a signal at a given time and another signal at a later time and such that this system is applied to a process manufacturing system in which the a plurality of components (different fuels, and other chemical inputs) are combined to generate/manufacture energy.)
 It would have been prima facie obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to have modified Andre, Yanqing, Lendaris, and Conant to incorporate the teachings of Adali to implement the neural network design with pre-structuring according to pairwise temporally offset associations of signals to a point manufacturing system. The modification would have been obvious because one of ordinary skill would have been motivated to improve the accuracy in monitoring a prediction of anomalies in a manufacturing system by using a TDNN neural network that has been structured according to a dependency analysis to determine before training the significant time-delays between a pair of signals that are most important for the monitoring and prediction process (Adali, [Abstract, pp. 38-39, Figure 5, Figure 6]).

Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
Shannon et al. (“Directed Extended Dependency Analysis for Data Mining”, Systems Science Faculty Publications and Presentations, ” Kybernetes, vol. 33, No. 5/6, 2004, pp. 1-7) teach the application of Extended Dependency Analysis for determining significant causal relations between variables using a mask analysis framework that evaluates causal dependences across time.

THIS ACTION IS MADE FINAL.  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action. 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to ROBERT LEWIS KULP whose telephone number is (571)272-7983. The examiner can normally be reached M, Th, F 8-5:30; Tu 8-3.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Miranda Huang, can be reached on 571-2709-7092. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/ROBERT LEWIS KULP/Examiner, Art Unit 2124            

/BRIAN M SMITH/Primary Examiner, Art Unit 2122