DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Response to Amendment
The amendment filed 2021-12-20 has been entered.  The status of the claims is as follows:
Claims 1, 3, 5-13, 15, and 17-20 remain pending in the application.
Claims 1, 3, 5-6, 9-13, 15, and 20 are amended.
Claims 2, 4, 14, and 16 are cancelled.
Response to Arguments
Applicant's arguments in response to rejections under 35 USC 103 have been fully considered but they are not persuasive. 
In Argument (A) on Remarks Pages 14-15, Applicant contends that “Li does not provide any inventive concept or idea of providing a technology for making an early detection of leaks from the old plant pipelines even in the noisy environments to thereby overcome the conventional technology of using the acoustic sensors”, adding the same for the combined arts Vespier and Romero: “Vespier is devoid of any discussion on providing a technology for making an early detection of leaks from the old plant pipelines even in the noisy environments to thereby overcome the conventional technology of using the acoustic sensors” and “Romero is devoid of any discussion on providing a technology for making an early detection of leaks from the old plant pipelines even in the noisy environments to thereby overcome the conventional technology of In re McLaughlin, 443 F.2d 1392, 170 USPQ 209 (CCPA 1971).
In Argument (B) on Remarks Page 15, in response to the previous Office Action stating: without performing such convolution operations. Thus, Vespier does not explicitly disclose extracting data for each trend interval so that sizes of the column vectors for each trend interval are the same. Accordingly, Vespier cannot properly be relied upon disclosing the above combination of features which is currently presented.”  Examiner respectfully points out that the language of the claim does not make any such restriction excluding a convolution operation, and thus Vespier still teaches the limitation as claimed.
In Argument (C) on Remarks Pages 15-16, in response to the previous Office Action stating:  “Since the teacher and student models are both learning, Romero discloses parallel learning”, Applicant argues that “Namely, the plurality of learners are configured to perform parallel learning for the multiple features extracted by the multi- feature extraction unit, in which all of the learners belong to student models. That is, in the embodiment of the present invention, parallel learning are not performed by the teacher and student models but by the student models…Thus, Romero does not explicitly disclose the multi-feature learning unit comprising a plurality of learners for performing parallel learning for the multiple features extracted by the multi-feature extraction unit.”  Examiner respectfully points out that the teacher and one student model learning is still “parallel learning”, and in combination with the multiple features of Li, Romero properly teaches the claimed limitation of “parallel learning for the multiple features”.  As for the amended matter “comprising a plurality of learners”, 
In Arguments related to Claims 3 and 15 on Remarks Pages 16-17, Applicant argues that “The Office action points out that performing the operations of Li several times, in order to train a machine learning model as suggested by Romero, would result in accumulating two-dimensional features, and any three-dimensional volume feature can be constructed by multiple two-dimensional ‘slices’.  However, Li in view of Romero does not explicitly disclose the three-dimensional volume feature of the ambiguity image is generated by accumulating two-dimensional features in a depth direction”  Examiner reiterates that while Li does not explicitly disclose a three-dimensional feature generated by accumulating two-dimensional features, that accumulating the two-dimensional features of Li repeatedly to perform Romero’s training, results in a three-dimensional feature, as the two dimensional feature is accumulated in a third dimension, in a direction that may be called “depth”, as the three orthogonal directions are often called “length”, “width”, and “depth”.  Thus, this limitation is fairly suggested by the combination of Li in view of Romero.

Claim Interpretation
The following is a quotation of 35 U.S.C. 112(f):
(f) Element in Claim for a Combination. – An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof. 

The following is a quotation of pre-AIA  35 U.S.C. 112, sixth paragraph:
An element in a claim for a combination may be expressed as a means or step for performing a specified function without the recital of structure, material, or acts in support thereof, and such claim shall be construed to cover the corresponding structure, material, or acts described in the specification and equivalents thereof.

This application includes one or more claim limitations that, per MPEP 2181(I), meet Prong A (“The Claim Limitation Uses the Term "Means" or "Step" or a Generic Placeholder (A Term That Is Simply A Substitute for "Means")) and Prong B (“The Term "Means" or "Step" or the Generic Placeholder Must Be Modified By Functional Language”) for interpretation under 112(f).  Such claim limitation(s) is/are: 
“multi-feature extraction unit for extracting” in Claim 1 and Claim 20
“transfer learning model generation unit for extracting” in Claim 1 and Claim 20
“multi-feature learning unit for receiving” in Claim 1 and Claim 20
“ambiguity feature extractor is configured to convert” in Claim 1 and Claim 20
“multi-trend correlation feature extractor for extracting” in Claim 1 and Claim 20
“means for updating the learning model” in Claim 9 and Claim 10
“multi-feature evaluation unit for finally evaluating” in Claim 11 and Claim 20
“multi-feature combination and optimization unit for repetitively performing” in Claim 12
Evaluating each of these, Examiner has determined the following regarding Prong C  (“The Term "Means" or "Step" or the Generic Placeholder Must Not Be Modified By Sufficient Structure, Material, or Acts for Achieving the Specified Function”):
“multi-feature extraction unit for extracting” in Claim 1 and Claim 20 fails Prong C, as the following acts for “extraction” are recited in the claim:  “the one-dimensional time series data stream for each sensor inputted from the plurality of sensors, wherein the multiple features comprise ambiguity image features that have been ambiguity- transformed from characteristics of the input data and multi-trend correlation image features extracted for each of multiple trend intervals according to a number of packet intervals constituting the data stream for each sensor”.  Thus, this limitation is not being interpreted under 112(f).
“transfer learning model generation unit for extracting” in Claim 1 and Claim 20 fails Prong C as the following acts for “extracting” are recited in the claim:  “wherein the learning model comprises a teacher model for extracting and forwarding information which has finished pre-learning, and a student model for receiving the extracted information”.  Thus, this limitation is not being interpreted under 112(f).
“multi-feature learning unit for receiving” in Claim 1 and Claim 20 fails Prong C, as the following acts for “receiving” are recited in the claim:  “performing parallel learning for the multiple features extracted by the multi-feature extraction unit, so as to calculate and output a loss”. Thus, this limitation is not being interpreted under 112(f).
“ambiguity feature extractor is configured to convert” in Claim 1 and Claim 20 fails Prong C, as the following acts for “convert” are recited in the claim:  “ambiguity transformation using the cross time- frequency spectral transformation and the 2D Fourier transformation”. Thus, this limitation is not being interpreted under 112(f).
“multi-trend correlation feature extractor for extracting” in Claim 1 and Claim 20 fails Prong C, as the following acts for “extracting” are recited in the claim:  “construct column vectors with data extracted during multiple trend intervals consisting of a short-term, a medium- term, and a long-term packet intervals in the data stream for each sensor”. Thus, this limitation is not being interpreted under 112(f).
“means for updating the learning model” in Claim 9 and Claim 10 passes Prong C, as the acts to perform this function are not recited.  Note that when a claim invokes 35 U.S.C. § 112(f) for a computer implemented means-plus-function claim, the specification must disclose the specific algorithm required to transform the general-purpose computing equipment into the required special purpose computer. See MPEP §2181(II)(B). In other words, because 35 U.S.C. §112(f) is invoked through the recitation of "means", the mere insertion of "a processor" does not provide the specific algorithm for updating the learning model.  Thus, this limitation is being interpreted under 112(f).  
“multi-feature evaluation unit for finally evaluating” in Claim 11 and Claim 20 fails Prong C, as the following acts for “evaluating” are recited in the claim:  “receiving results that have been learned from the multi-feature learning unit”. Thus, this limitation is not being interpreted under 112(f).
“multi-feature combination and optimization unit for repetitively performing” in Claim 12 fails Prong C, as the following acts for “repetitively performing” are recited in the claim:  “an optimal combination of the multiple features according to a loss is acquired based on the learning results inputted in the multi-feature evaluation unit.”  Thus, this limitation is not being interpreted under 112(f).
Regarding “means for updating the learning model” in Claim 9 and Claim 10, because this/these claim limitation(s) is/are being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, it/they is/are being interpreted to cover the corresponding structure described in the specification as performing the claimed function, and equivalents thereof.
If applicant does not intend to have this/these limitation(s) interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph, applicant may:  (1) amend the claim limitation(s) to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph (e.g., by reciting sufficient structure to perform the claimed function); or (2) present a sufficient showing that the claim limitation(s) recite(s) sufficient structure to perform the claimed function so as to avoid it/them being interpreted under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph.
Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 9 and 10 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
Claim limitation “means for updating the learning model” in Claim 9 and Claim 10 invokes 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph. However, the written description fails to disclose the corresponding structure, material, or acts for performing the entire claimed function and to clearly link the structure, material, or acts to the function. While Instant Specification [0046] broadly describes updating the model under certain conditions, no details are given as to which algorithm(s) is/are used to update the model.  Therefore, the claim is indefinite and is rejected under 35 U.S.C. 112(b) or pre-AIA  35 U.S.C. 112, second paragraph.  Examiner suggests amending the claim to state “…processor is caused to further perform 

Applicant may:
(a)        Amend the claim so that the claim limitation will no longer be interpreted as a limitation under 35 U.S.C. 112(f) or pre-AIA  35 U.S.C. 112, sixth paragraph; 
(b)        Amend the written description of the specification such that it expressly recites what structure, material, or acts perform the entire claimed function, without introducing any new matter (35 U.S.C. 132(a)); or 
(c)        Amend the written description of the specification such that it clearly links the structure, material, or acts disclosed therein to the function recited in the claim, without introducing any new matter (35 U.S.C. 132(a)).
If applicant is of the opinion that the written description of the specification already implicitly or inherently discloses the corresponding structure, material, or acts and clearly links them to the function so that one of ordinary skill in the art would recognize what structure, material, or acts perform the claimed function, applicant should clarify the record by either: 
(a)        Amending the written description of the specification such that it expressly recites the corresponding structure, material, or acts for performing the claimed function and clearly links or associates the structure, material, or acts to the claimed function, without introducing any new matter (35 U.S.C. 132(a)); or 
(b)        Stating on the record what the corresponding structure, material, or acts, which are implicitly or inherently set forth in the written description of the specification, perform the claimed function. For more information, see 37 CFR 1.75(d) and MPEP §§ 608.01(o) and 2181.

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1, 3, 5-8, 11, 13, 15, 18, and 20 are rejected under 35 U.S.C. 103 as being unpatentable over Li et. al. (“Leak location in gas pipelines using cross-time–frequency spectrum of leakage-induced acoustic vibrations”; hereinafter Li) in view of Vespier et. al. (“Mining Characteristic Multi-Scale Motifs in Sensor-Based Time Series; hereinafter Vespier), Romero et. al. (“Fitnets: Hints for This Deep Nets”; hereinafter Romero), and Cheng et. al. (US 2009/0319457 A1; hereinafter Cheng).
As per Claim 1, Li teaches a multi-feature extraction unit for extracting multiple features from the one-dimensional time series data stream for each sensor inputted from the plurality of sensors, wherein the multiple features comprise ambiguity image features that have been ambiguity-transformed from characteristics of the input data (Li, Pg 3890 Section 2, discloses: “The gas-leak-induced acoustic signals are acquired at two spatially separate monitoring points using acoustic sensors or accelerometers mounted on either sides of a suspected leakage shown in Fig. 1”

    PNG
    media_image1.png
    258
    928
    media_image1.png
    Greyscale

Here, Li discloses a data stream (“acoustic signals”) for each sensor inputted from the plurality of sensors (“two spatially separate monitoring points using acoustic sensors”).  Li, Pg 3891 Section 3.1 Line 1, discloses a one-dimensional time series:  “Time–frequency representations (TFRs) characterize signals over a time and frequency plane by mapping a one-dimensional (1-D) signal of time into a two-dimensional (2-D) signal of time and frequency.”  Li, Pg 3891 Section 3.1 Lines 4-6, discloses “Time–frequency analysis (TFA) has received considerable attention as a powerful tool for analyzing time-varying nonstationary signals [29]. Nonstationary signal analysis is one of the main topics in the field of fault diagnosis. The TFA can identify the time-varying features and is an effective tool to extract fault information contained in nonstationary signals” Here, Li discloses a multi-feature extraction unit for extracting multiple features, as Li discloses, “identify the time-varying features”, wherein the word “features” is plural, and thus multiple features.  Furthermore, Li, top of Page 3893, discloses:  “It can be seen in Fig. 2 that there are basically six domains, all related to one another: the time (t), frequency (ω), temporal correlation (t,τ), spectral correlation (ω,θ), ambiguity (θ,τ), and time–frequency (t,ω) domains. Each domain possesses unique features and is suitable for representing certain types of signals.”  Thus, Li discloses wherein the multiple features comprise ambiguity features.  Li, Page 3894 Section 4 Para 2, discloses:  “The cross time–frequency distributions have been developed by extension of the Cohen class distributions and attracted increasing attention considering the ability to preserve phase difference between two signals as a function of time and frequency [39], [40], [41], [42]. For two signals denoted by x(t) and y(t), then according to Eq. (17), the CTFS can be obtained by 2-D FT of the product between the cross ambiguity function and the kernel function as”.  Here, Li discloses ambiguity image features that have been ambiguity-transformed (“2-D FT of the product between the cross ambiguity function and the kernel function”, wherein “FT” is “Fourier Transform” and “2-D” represents an “image”) from characteristics of the input data (“two signals denoted by x(t) and y(t)”)).
wherein the multi-feature extraction unit comprises an ambiguity feature extractor configured to convert characteristics in a form of sensor data from the one dimensional time series data stream transmitted from each of the sensors into an image feature through ambiguity transformation using the cross time-frequency spectral transformation and the 2D Fourier transformation (Li, Page 3894 Section 4 Para 2, discloses:  “The cross time–frequency distributions have been developed by extension of the Cohen class distributions and attracted increasing attention considering the ability to preserve phase difference between two signals as a function of time and frequency [39], [40], [41], [42]. For two signals denoted by x(t) and y(t), then according to Eq. (17), the CTFS can be obtained by 2-D FT of the product between the cross ambiguity function and the kernel function as”.  Here, Li discloses ambiguity feature extractor is configured to convert characteristics in a form of sensor data from the data stream transmitted from each of the sensors (“two signals denoted by x(t) and y(t)”) through ambiguity transformation (“cross ambiguity function”) using the cross time-frequency spectral transformation (“CTFS”) and the 2D Fourier transformation (“2-D FT”)).  This produces an image feature as described in Li Pg. 3891 Section 3.1 (“Time–frequency representations (TFRs) characterize signals over a time and frequency plane by mapping a one-dimensional (1-D) signal of time into a two-dimensional (2-D) signal of time and frequency”), wherein the image feature is “a two-dimensional (2-D) signal of time and frequency”.  See Li Figure 7).

    PNG
    media_image2.png
    300
    584
    media_image2.png
    Greyscale

	However, Li does not explicitly teach at least one processor; and a memory having instructions stored thereon, which, when executed by the at least one processor, cause the at least one processor to perform; multi-trend correlation features extracted for each of multiple trend intervals according to a number of packet intervals constituting the data stream for each sensor; a transfer-learning model generation unit for extracting useful multi-feature information from a learning model which has finished pre-learning for the multiple features and for forwarding the extracted multi-feature information to a multi-feature learning unit below, so as to generate a learning model that performs transfer learning for each of the multiple features, wherein the learning model comprises a teacher model for extracting and forwarding information which has finished pre-learning, and a student model for receiving the extracted information; the multi-feature learning unit comprising a plurality of learners for receiving learning variables from the learning model for each of the multiple features and for performing parallel learning for the multiple features extracted by the multi-feature extraction unit, so as to calculate and output a loss; a multi-trend correlation feature extractor is configured to construct column vectors with data extracted during multiple trend intervals consisting of a short-term, a medium- term, and a long-term packet intervals in the data stream for each sensor, and to extract data for each trend interval so that sizes of the column vectors for each trend interval are the same, so as to output the multi-trend correlation image features.
	Vespier teaches multi-trend correlation features extracted for each of multiple trend intervals according to a number of packet intervals constituting the data stream for each sensor (Vespier, Pg. 1 Intro, discloses:  “This paper is concerned with the discovery of temporal patterns in large time series produced from physical sensors. In all but the most trivial applications, such sensor data will reflect the complexity of the physical system under investigation and will show a combination of multiple effects. The systems we aim to investigate here often have two important characteristics: a) multiple phenomena are at play in the sensor signal and typically occur at different time scales, b) each phenomenon will involve recurring events that will show up in the signal as repeating segments of data, often deformed and warped. In this paper, we propose a method that elegantly combines these two characteristics in order to discover recurring events at multiple time scales.”  Here, Vespier discloses multi-trend correlation features extracted, as Vespier discloses “a combination of multiple effects”, which are correlation features as they are correlated with “the discovery of temporal patterns”.  These represent multiple trends (“multiple phenomena”) that occur at multiple trend intervals (“typically occur at different time scales”).  These are according to a number of packet intervals constituting the data stream for each sensor, as Vespier discloses “time series produced from physical sensors”, as each sensor sends a data stream, which consists of sending a piece of data, or a packet, at some given frequency, or interval, which may be called a packet interval.  Thus, each interval, trend interval (“different time scales”) would then comprise a number of packet intervals.)
a multi-trend correlation feature extractor is configured to construct column vectors with data extracted during multiple trend intervals consisting of a short-term, a medium-term, and a long-term packet intervals in the data stream for each sensor, and to extract data for each trend interval so that sizes of the column vectors for each trend interval are the same, so as to output the multi-trend correlation image features.  (Vespier, Page 1 Intro, discloses:  “This paper is concerned with the discovery of temporal patterns in large time series produced from physical sensors. In all but the most trivial applications, such sensor data will reflect the complexity of the physical system under investigation and will show a combination of multiple effects. The systems we aim to investigate here often have two important characteristics: a) multiple phenomena are at play in the sensor signal and typically occur at different time scales, b) each phenomenon will involve recurring events that will show up in the signal as repeating segments of data, often deformed and warped. In this paper, we propose a method that elegantly combines these two characteristics in order to discover recurring events at multiple time scales.”  Here, Vespier discloses multi-trend correlation features extracted, as Vespier discloses “a combination of multiple effects”, which are correlation features as they are correlated with “the discovery of temporal patterns”.  These represent multiple trends (“multiple phenomena”) that occur at multiple trend intervals (“typically occur at different time scales”).  These are according to a number of packet intervals constituting the data stream for each sensor, as Vespier discloses “time series produced from physical sensors”, as each sensor sends a data stream, which consists of sending a piece of data, or a packet, at some given frequency, or interval, which may be called a packet interval.  Thus, each interval, trend interval (“different time scales”) would then comprise a number of packet intervals.  Note that Vespier’s “different time scales” only need to number 3 at minimum, in order to comprise “short-term”, “medium-term”, and “long-term” scales.  Vespier does not limit the number of time scales to 2, and thus suggests short-term, a medium-term, and a long-term packet intervals.
Vespier discloses in Page 2 Section 2.1:  “A time series of length n is an ordered sequence of values x = x[1]; : : : ; x[n] of finite precision”.  Thus, here Vespier discloses that data stream for each sensor is a vector. It could just as well be a row vector a column vector, as a row vector is simply a transposed column vector.  Furthermore, Vespier discloses in Page 4 Section 3.1.1:  “Scale-space images [10] are a widely used scale parameterization technique for one-dimensional signals1. We use them to characterize the contribution of the motifs at increasingly higher temporal scales while, at the same time, removing (smoothing out) the effect of the motifs at finer scales. Given a signal x, its scale-space image is the family of sigma-smoothed signals x over the scale parameter Phix defined as follows:

    PNG
    media_image3.png
    41
    483
    media_image3.png
    Greyscale

where * is the operation of convolution, gsigma is a Gaussian kernel having standard deviation sigma, and Phix(sigma) = x.”  Vespier continues:  “We quantize the scale-space image across the scale dimension by considering a fixed set of scale parameters S and computing Phix(sigma) only for sigma e S.”  Vespier concludes the section with:  “We deal with the multi-scale aspect of the data by identifying motifs in each of the scales in the scale-space image.”
	Here, Vespier discloses to extract data for each trend interval, as Vespier performs the operation of extracting the scale-space image (extract data) and doing this for each trend interval (“characterize the contribution of the motifs at increasingly higher temporal scales”…” quantize the scale-space image across the scale dimension by considering a fixed set of scale parameters S and computing Phix(sigma) only for sigma e S”).  Thus, Vespier is calculating the scale-space image to represent different time scales, or intervals.  Also note that Vespier discloses that the operation to produce this scale-space image is a convolution operation.  One of ordinary skill in the art will appreciate that the output of the application of a convolution does not change the dimension of the entity on the left side of the operation.  In other words, in A*B, where A is 3x3 and B is 2*2, the output is 3*3.  Recall earlier that the data stream for each sensor is a vector (“A time series of length n”).  We can consider this an nx1 column vector.  As explained above, applying a convolution to represent different intervals will also result in an n*1 vector.  Thus, Vespier discloses construct column vectors with data extracted and sizes of the column vectors for each trend interval are the same, so as to output the multi-trend correlation features.  For reference, another paper by Vespier et. al. gives more detail on scale space images (“MDL-Based Analysis of Time Series at Multiple Time-Scales”), in which the scale space images are described as being similar to the concepts of low pass, high pass, and band pass filtering, which are recited in the Instant Specification [0031].)
	Li and Vespier are analogous art because they are both in the field of endeavor of condition monitoring.
	Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, to combine time series analysis to find gas leaks of Li with the multiple time scale analysis of Vespier.  One would be motivated to do so in order to avoid missing potentially dangerous or costly trends in different or overlapping time scales (Vespier, Page 1 Intro: “As a motivating example, we consider InfraWatch [1, 4, 9], a Structural Health Monitoring (SHM) project involving huge quantities of sensor data collected at a major Dutch highway bridge. Such data fits our topic well, as it is subject to a number of effects that show both recurring events (traffic, daily temperature cycles) and largely varying time scales. Figure 1 shows 12 days of strain measurements collected at this bridge (some 10 million readings) and the recurring events present in it: 1) individual peaks due to passing vehicles lasting a few seconds (top) and 2) recurring patterns due to changes of the external temperature (bottom). Note that different effects appear in a mixed fashion, and events at different time scales overlap.”)
	However, the combination of Li and Vespier thus far fails to teach at least one processor; and a memory having instructions stored thereon, which, when executed by the at least one processor, cause the at least one processor to perform; a transfer-learning model generation unit for extracting useful multi-feature information from a learning model which has finished pre-learning for the multiple features and for forwarding the extracted multi-feature information to a multi-feature learning unit below, so as to generate a learning model that performs transfer learning for each of the multiple features, wherein the learning model comprises a teacher model for extracting and forwarding information which has finished pre-learning, and a student model for receiving the extracted information; and a multi-feature learning unit comprising a plurality of learners for receiving learning variables from the learning model for each of the multiple features and for performing parallel learning for the multiple features extracted by the multi-feature extraction unit, so as to calculate and output a loss.
	Romero teaches at least one processor; and a memory having instructions stored thereon, which, when executed by the at least one processor, cause the at least one processor to perform (Romero, Page 9 in Acknowledgments, discloses:  “We thank the developers of Theano (Bastien et al., 2012) and Pylearn2 (Goodfellow et al., 2013a) and the computational resources provided by Compute Canada and Calcul Qu´ebec”.  Here, “computational resources” implies the use of a processor with memory and instructions stored theron.)
a transfer-learning model generation unit for extracting useful multi-feature information from a learning model which has finished pre-learning for the multiple features and for forwarding the extracted multi-feature information to a multi-feature learning unit, so as to generate a learning model that performs transfer learning for each of the multiple features, wherein the learning model comprises a teacher model for extracting and forwarding information which has finished pre-learning, and a student model for receiving the extracted information (Recall that Li above disclosed multiple features.  Romero, Page 2 Section 2.1, discloses:  “In order to obtain a faster inference, we explore the recently proposed compression framework (Hinton & Dean, 2014), which trains a student network, from the softened output of an ensemble of wider networks, teacher network.”  Romero, Page 4 Section 2.3, continues:  “We train the FitNet in a stage-wise fashion following the student/teacher paradigm. Figure 1 summarizes the training pipeline. Starting from a trained teacher network and a randomly initialized FitNet”.  Here, Romero discloses a learning model comprising a teacher model which has finished pre-learning (“trained teacher network”) so as to generate a learning model (“student network”).  Romero discloses extracting useful information from the teacher network (“the softened output of an ensemble of wider networks, teacher network”) and for forwarding the extracted information to a student model (“trains a student network, from the softened output of an ensemble of wider networks, teacher network”).  This results in a transfer-learning model generation unit that produces a learning model that performs transfer learning, as the training knowledge is transferred from the teacher network to the student network.)
a multi-feature learning unit for receiving learning variables from the learning model for each of the multiple features and for performing parallel learning for the multiple features extracted by the multi-feature extraction unit, so as to calculate and output a loss (Recall that Li above disclosed multiple features and a multi-feature extraction unit.  Romero, Page 2 Section 2.1, discloses:  “In order to obtain a faster inference, we explore the recently proposed compression framework (Hinton & Dean, 2014), which trains a student network, from the softened output of an ensemble of wider networks, teacher network.”  Romero, Page 3 Section 2.2, discloses:  “Given that the teacher network will usually be wider than the FitNet, the selected hint layer may have more outputs than the guided layer. For that reason, we add a regressor to the guided layer, whose output matches the size of the hint layer. Then, we train the FitNet parameters from the first layer up to the guided layer as well as the regressor parameters by minimizing the following loss function”.  Here, Romero discloses the learning unit (“student network”, also known as the “FitNet”) for receiving learning variables from the learning model (the student network receives learning variables from the teacher network:  “which trains a student network, from the softened output of an ensemble of wider networks, teacher network”).  The learning unit performs learning, so as to calculate and output a loss (“we train the FitNet parameters from the first layer up to the guided layer as well as the regressor parameters by minimizing the following loss function”).  Since the teacher and student models are both learning, Romero discloses parallel learning).
Li, Vespier, and Romero are analogous art because Romero is reasonably pertinent to the problem faced by Li and Vespier, as machine learning may be applied to a complex process such as gas leak detection.  See MPEP 2141.01(a) “Analogous and Nonanalogous Art” Section I last sentence:  “Rather, a reference is analogous art to the claimed invention if: (1) the reference is from the same field of endeavor as the claimed invention (even if it addresses a different problem); or (2) the reference is reasonably pertinent to the problem faced by the inventor (even if it is not in the same field of endeavor as the claimed invention”.
	Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, to combine the gas leak detection of Li and Vespier with the teacher-student learning of Romero.  The combination would result in using machine learning to detect gas leaks, which would save resources by not having to rely on employees to enter hard coded faults, and not suffering losses from faults that were not included in the rule set.  One would be motivated to use Romero’s teacher-student learning to save time and resources by performing the inference phase of machine learning on a smaller model (Romero, Abstract: “While depth tends to improve network performances, it also makes gradient-based training more difficult since deeper networks tend to be more non-linear. The recently proposed knowledge distillation approach is aimed at obtaining small and fast-to-execute models, and it has shown that a student network could imitate the soft output of a larger teacher network or ensemble of networks… This allows one to train deeper students that can generalize better or run faster, a trade-off that is controlled by the chosen student capacity. For example, on CIFAR-10, a deep student network with almost 10.4 times less parameters outperforms a larger, state-of-the-art teacher network.”)
	 The combination of Li, Vespier, and Romero would result in the claimed preamble: A machine learning apparatus (Romero) based on multi-feature extraction (Li) and transfer learning (Romero) from one-dimensional time series data streams having time delay transmitted from a plurality of sensors (Li).  Li discloses time delay on Pg 3891 between Eq 1 and 2:  “D is the time delay equivalent to time difference of arrival (TDOA) between the
Two collected acoustic signals.”
	However, the combination of Li, Vespier, and Romero fails to teach a multi-feature learning unit comprising a plurality of learners.
Cheng teaches a multi-feature learning unit comprising a plurality of learners (Cheng, Para [0035], discloses “Multiple classifiers are constructed based on the same dataset with different features. Each classifier has sufficient discriminative power based on the selected features, although it could still make mistakes on the partially covered instances. However, since the feature sets are disjoint, each classifier tends to make uncorrelated errors which can be eliminated by averaging. The outputs of multiple classifiers are combined by a cascaded feature ensemble.”  Here, Cheng discloses model is configured in the same number as the multiple features, as each model is based on a different feature (“Multiple classifiers are constructed based on the same dataset with different features”).   These “multiple classifiers” are a plurality of learners.)

    PNG
    media_image4.png
    521
    1202
    media_image4.png
    Greyscale

	Cheng and the combination of Li, Vespier, and Romero are analogous art because they are all in the field of endeavor of machine learning.
	Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, to combine the student-teacher learning of the combination of Li, Vespier, and Romero, with the feature-specific models of Cheng.  One would be motivated to do so in order to increase accuracy of learning the combination of the different features, as uncorrelated errors can be averaged out (Cheng, [0035]: “Multiple classifiers are constructed based on the same dataset with different features. Each classifier has sufficient discriminative power based on the selected features, although it could still make mistakes on the partially covered instances. However, since the feature sets are disjoint, each classifier tends to make uncorrelated errors which can be eliminated by averaging. The outputs of multiple classifiers are combined by a cascaded feature ensemble.”)

	As per Claim 3, the combination of Li, Vespier, Romero, and Cheng teaches the apparatus of Claim 1.  Li teaches ambiguity features comprise a two-dimensional image.  (Li, Page 3894 Section 4 Para 2, discloses:  “The cross time–frequency distributions have been developed by extension of the Cohen class distributions and attracted increasing attention considering the ability to preserve phase difference between two signals as a function of time and frequency [39], [40], [41], [42]. For two signals denoted by x(t) and y(t), then according to Eq. (17), the CTFS can be obtained by 2-D FT of the product between the cross ambiguity function and the kernel function as”).
	However, Li does not explicitly teach wherein the ambiguity features comprise a three- dimensional volume feature generated by accumulating two-dimensional features in a depth direction.
	Romero, Page 4 Section 2.3, discloses: “We train the FitNet in a stage-wise fashion following the student/teacher paradigm. Figure 1 summarizes the training pipeline. Starting from a trained teacher network and a randomly initialized FitNet (Fig. 1 (a)), we add a regressor parameterized by Wr on top of the FitNet guided layer and train the FitNet parameters WGuided up to the guided layer to minimize Eq. (3) (see Fig. 1 (b)). Finally, from the pre-trained parameters, we train the parameters of whole FitNet WS to minimize Eq. (2) (see Fig. 1 (c)). Algorithm 1 details the FitNet training process.” Here, Romero discloses an iterative process in which parameters are trained.  This is expected, as machine learning is achieved by  performing operations on several examples in a training set to minimize a loss.
	Performing the operations of Li several times, in order to train a machine learning model as suggested by Romero, would result in accumulating two-dimensional features (the image feature of Li).  One of ordinary skill in mathematics will appreciate that any three- dimensional volume feature can be constructed by multiple two-dimensional “slices”.  It is inherent that accumulating two-dimensional features and stacking them results in a three- dimensional volume feature.  The first two dimensions are, conventionally, called width and height. The third dimension is conventionally called depth, and thus the combination of Li and Romero results in the claimed limitation that ambiguity features comprise a three- dimensional volume feature generated by accumulating two-dimensional features in a depth direction.)
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Romero with the combination of Li, Vespier, and Cheng, for at least the reasons recited in Claim 1.

As per Claim 5, the combination of Li, Vespier, Romero, and Cheng teaches the apparatus of Claim 1 as well as teacher model and student model (see Rejection to Claim 1). Cheng teaches wherein the [student] model is configured in the same number as the multiple features [and useful information of the teacher model that has finished pre-learning is forwarded to a plurality of student models.] (Recall above that Romero teaches teacher and student models.  Cheng, Para [0035], discloses “Multiple classifiers are constructed based on the same dataset with different features. Each classifier has sufficient discriminative power based on the selected features, although it could still make mistakes on the partially covered instances. However, since the feature sets are disjoint, each classifier tends to make uncorrelated errors which can be eliminated by averaging. The outputs of multiple classifiers are combined by a cascaded feature ensemble.”  Here, Cheng discloses model is configured in the same number as the multiple features, as each model is based on a different feature (“Multiple classifiers are constructed based on the same dataset with different features”).  

    PNG
    media_image4.png
    521
    1202
    media_image4.png
    Greyscale

When combined with Romero’s student and teacher models, this would result in useful information of the teacher model that has finished pre-learning is forwarded to a plurality of student models.)
	It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Cheng with the combination of Li, Vespier, and Romero, for at least the reasons recited in Claim 1.

	As per Claim 6, the combination of Li, Vespier, Romero, and Cheng teaches the apparatus of Claim 1.  Romero teaches wherein the student model is configured as a single common model, and the useful information of the teacher model that has finished pre-learning is forwarded to the single common student model so as to be learned.  (Romero, Page 2 Section 2.1, discloses:  “In order to obtain a faster inference, we explore the recently proposed compression framework (Hinton & Dean, 2014), which trains a student network, from the softened output of an ensemble of wider networks, teacher network.”  Romero, Page 4 Section 2.3, continues:  “We train the FitNet in a stage-wise fashion following the student/teacher paradigm. Figure 1 summarizes the training pipeline. Starting from a trained teacher network and a randomly initialized FitNet”.  Here, Romero discloses teacher model (“trained teacher network”) for extracting and forwarding information (“from the softened output of an ensemble of wider networks, teacher network”) which has finished pre-learning (“trained teacher network”) and a student model (“student network”) for receiving the extracted information, and the useful information of the teacher model that has finished pre-learning is forwarded to the single common student model so as to be learned (“trains a student network, from the softened output of an ensemble of wider networks, teacher network”) and for forwarding the extracted information to a learning unit below (“trains a student network, from the softened output of an ensemble of wider networks, teacher network”).  This results in a transfer-learning model generation unit that produces a learning model that performs transfer learning, as the training knowledge is transferred from the teacher network to the student network.  Note that Romero only discloses a single “student network”, and thus discloses wherein the student model is configured as a single common model).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Romero with the combination of Li, Vespier, and Cheng, for at least the reasons recited in Claim 1.

As per Claim 7, the combination of Li, Vespier, Romero, and Cheng teaches the apparatus of Claim 5.  Romero teaches wherein the useful information extracted from the teacher model is a single piece of hint information corresponding to an output of feature maps comprising learning variable information from a learning data input to any layer (Romero, Page 3 Section 2.2, discloses: “In order to help the training of deep FitNets (deeper than their teacher), we introduce hints from the teacher network. A hint is defined as the output of a teacher’s hidden layer responsible for guiding the student’s learning process. Analogously, we choose a hidden layer of the FitNet, the guided layer, to learn from the teacher’s hint layer. We want the guided layer to be able to predict the output of the hint layer. Note that having hints is a form of regularization and thus, the pair hint/guided layer has to be chosen such that the student network is not over-regularized. The deeper we set the guided layer, the less flexibility we give to the network and, therefore, FitNets are more likely to suffer from over-regularization. In our case, we choose the hint to be the middle layer of the teacher network. Similarly, we choose the guided layer to be the middle layer of the student network.”  Here, Romero discloses useful information extracted from the teacher model is a single piece of hint information (“we introduce hints from the teacher network. A hint is defined as the output of a teacher’s hidden layer responsible for guiding the student’s learning process”) corresponding to an output of feature maps comprising learning variable information from a learning data input (“the output of a teacher’s hidden layer”).  Note that a hidden layer is a feature map, as it comprises a weighted combination of features that were input variables to a neural network, and thus the feature map comprises learning variable information from a learning data input.  The hint is input to any layer (“Analogously, we choose a hidden layer of the FitNet, the guided layer, to learn from the teacher’s hint layer”), and there are various choices for the layer:  “The deeper we set the guided layer, the less flexibility we give to the network and, therefore, FitNets are more likely to suffer from over-regularization. In our case, we choose the hint to be the middle layer of the teacher network.”)
	wherein forwarding of this single piece of hint information is performed such that a loss function for the Euclidean distance between an output result of feature maps at a layer selected from the teacher model and an output result of feature maps at a layer selected from the student model is minimized (Romero above discloses a hint between an output result of feature maps at a layer selected from the teacher model and an output result of feature maps at a layer selected from the student model (“the output of a teacher’s hidden layer”) to (“hidden layer of the FitNet, the guided layer, to learn from the teacher’s hint layer”).  Romero, Page 3 Section 2.2 Para 2, further discloses:  “Given that the teacher network will usually be wider than the FitNet, the selected hint layer may have more outputs than the guided layer. For that reason, we add a regressor to the guided layer, whose output matches the size of the hint layer. Then, we train the FitNet parameters from the first layer up to the guided layer as well as the regressor parameters by minimizing the following loss function:

    PNG
    media_image5.png
    48
    777
    media_image5.png
    Greyscale

where uh and vg are the teacher/student deep nested functions up to their respective hint/guided layers with parameters WHint and WGuided, r is the regressor function on top of the guided layer with parameters Wr.”  Here, Romero discloses a loss function is minimized (“minimizing the following loss function”).  Also, note that Romero also discloses that a loss function is for the Euclidean distance, as Romero uses the notation ||x|| which indicates Euclidean distance (shortened from ||x||2 as the 2 subscript is often omitted for the Euclidean norm, as the Euclidean (or L2) norm is the most commonly used norm).  Here, Romero discloses one half of the Euclidean distance, squared.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Romero with the combination of Li, Vespier, and Cheng, for at least the reasons recited in Claim 1.

	As per Claim 8, the combination of Li, Vespier, Romero, and Cheng teaches the apparatus of Claim 6.  Romero teaches wherein the useful information extracted from the teacher model is a single piece of hint information corresponding to an output of feature maps comprising learning variable information from a learning data input to any layer (Romero, Page 3 Section 2.2, discloses: “In order to help the training of deep FitNets (deeper than their teacher), we introduce hints from the teacher network. A hint is defined as the output of a teacher’s hidden layer responsible for guiding the student’s learning process. Analogously, we choose a hidden layer of the FitNet, the guided layer, to learn from the teacher’s hint layer. We want the guided layer to be able to predict the output of the hint layer. Note that having hints is a form of regularization and thus, the pair hint/guided layer has to be chosen such that the student network is not over-regularized. The deeper we set the guided layer, the less flexibility we give to the network and, therefore, FitNets are more likely to suffer from over-regularization. In our case, we choose the hint to be the middle layer of the teacher network. Similarly, we choose the guided layer to be the middle layer of the student network.”  Here, Romero discloses useful information extracted from the teacher model is a single piece of hint information (“we introduce hints from the teacher network. A hint is defined as the output of a teacher’s hidden layer responsible for guiding the student’s learning process”) corresponding to an output of feature maps comprising learning variable information from a learning data input (“the output of a teacher’s hidden layer”).  Note that a hidden layer is a feature map, as it comprises a weighted combination of features that were input variables to a neural network, and thus the feature map comprises learning variable information from a learning data input.  The hint is input to any layer (“Analogously, we choose a hidden layer of the FitNet, the guided layer, to learn from the teacher’s hint layer”), and there are various choices for the layer:  “The deeper we set the guided layer, the less flexibility we give to the network and, therefore, FitNets are more likely to suffer from over-regularization. In our case, we choose the hint to be the middle layer of the teacher network.”)
	wherein forwarding of this single piece of hint information is performed such that a loss function for the Euclidean distance between an output result of feature maps at a layer selected from the teacher model and an output result of feature maps at a layer selected from the student model is minimized (Romero above discloses a hint between an output result of feature maps at a layer selected from the teacher model and an output result of feature maps at a layer selected from the student model (“the output of a teacher’s hidden layer”) to (“hidden layer of the FitNet, the guided layer, to learn from the teacher’s hint layer”).  Romero, Page 3 Section 2.2 Para 2, further discloses:  “Given that the teacher network will usually be wider than the FitNet, the selected hint layer may have more outputs than the guided layer. For that reason, we add a regressor to the guided layer, whose output matches the size of the hint layer. Then, we train the FitNet parameters from the first layer up to the guided layer as well as the regressor parameters by minimizing the following loss function:

    PNG
    media_image5.png
    48
    777
    media_image5.png
    Greyscale

where uh and vg are the teacher/student deep nested functions up to their respective hint/guided layers with parameters WHint and WGuided, r is the regressor function on top of the guided layer with parameters Wr.”  Here, Romero discloses a loss function is minimized (“minimizing the following loss function”).  Also, note that Romero also discloses that a loss function is for the Euclidean distance, as Romero uses the notation ||x|| which indicates Euclidean distance (shortened from ||x||2 as the 2 subscript is often omitted for the Euclidean norm, as the Euclidean (or L2) norm is the most commonly used norm).  Here, Romero discloses one half of the Euclidean distance, squared.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Romero with the combination of Li, Vespier, and Cheng, for at least the reasons recited in Claim 1.

	As per Claim 11, the combination of Li, Vespier, Romero, and Cheng teaches the apparatus of Claim 1.  Li teaches wherein the at least one processor is caused to further perform a multi-feature evaluation unit  (Li, Pg 3891 Section 3.1 Lines 4-6, discloses “Time–frequency analysis (TFA) has received considerable attention as a powerful tool for analyzing time-varying nonstationary signals [29]. Nonstationary signal analysis is one of the main topics in the field of fault diagnosis. The TFA can identify the time-varying features and is an effective tool to extract fault information contained in nonstationary signals” Here, Li discloses a multi-feature evaluation unit, as Li discloses, “identify the time-varying features”, wherein the word “features” is plural, and thus multiple features, and this is used to evaluate (“analysis”)).
However, Li does not explicitly teach finally evaluating learning results by receiving results that have been learned from the multi-feature learning
Romero teaches finally evaluating learning results by receiving results that have been learned from the multi-feature learning (Recall that Li discloses multiple features.  Romero, Page 2 Section 2.1, discloses:  “In order to obtain a faster inference, we explore the recently proposed compression framework (Hinton & Dean, 2014), which trains a student network, from the softened output of an ensemble of wider networks, teacher network.”  Romero, Page 3 Section 2.2, discloses:  “Given that the teacher network will usually be wider than the FitNet, the selected hint layer may have more outputs than the guided layer. For that reason, we add a regressor to the guided layer, whose output matches the size of the hint layer. Then, we train the FitNet parameters from the first layer up to the guided layer as well as the regressor parameters by minimizing the following loss function”.  Here, Romero discloses the learning unit (“student network”, also known as the “FitNet”).  The learning unit performs learning, so as to calculate and output a loss (“we train the FitNet parameters from the first layer up to the guided layer as well as the regressor parameters by minimizing the following loss function”).  Finally, the learning model performs an “inference”, or a result, and thus Romero discloses finally evaluating learning results by receiving results that have been learned from the multi-feature learning.)
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Romero with the combination of Li, Vespier, and Cheng, for at least the reasons recited in Claim 1.

	As per Claim 13, Claim 13 is a method claim corresponding to apparatus Claim 1.  Claim 13 is rejected for the same reasons as Claim 1.

	As per Claim 15, Claim 15 is a method claim corresponding to apparatus Claim 3.  Claim 15 is rejected for the same reasons as Claim 3.

As per Claim 18, Claim 18 is a method claim corresponding to apparatus Claim 11.  Claim 18 is rejected for the same reasons as Claim 11.

As per Claim 20, Claim 20 is an apparatus claim that is nearly identical to Claim 11, except that it explicitly recites “An apparatus for detecting fine leaks” and “evaluating whether there is a fine leak”.  The combination of Li, Vespier, Romero, and Cheng teaches the apparatus of Claim 11.  Li discloses detecting leaks in the final sentence of Pg. 3901 Conclusion:  “The results demonstrate that the CTFS-based location method is more feasible for improving the leak detection in gas pipelines using the frequency-varying acoustic speed of real-time determination instead of constant speed”.  There is no indication in Li that this system is incapable of detecting “fine” leaks, and thus the system of detecting “leaks” in general can reasonably be interpreted to detect “fine leaks”). 

Claims 9-10 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Li, Vespier, Romero, and Cheng in view of Chaoji et. al. (US 10,380,498 B1; hereinafter Chaoji).
As per Claim 9, the combination of Li, Vespier, and Romero teaches the apparatus of Claim 1.  Romero teaches learning model generated in the transfer-learning model generation unit.  Romero, Page 4 Section 2.3, discloses:  “We train the FitNet in a stage-wise fashion following the student/teacher paradigm. Figure 1 summarizes the training pipeline. Starting from a trained teacher network and a randomly initialized FitNet”.  Here, Romero discloses a learning model which has finished pre-learning (“trained teacher network”) learning unit below, so as to generate a learning model (“student network”).  Romero discloses extracting useful information from the teacher network (“the softened output of an ensemble of wider networks, teacher network”) and for forwarding the extracted information to a learning unit below (“trains a student network, from the softened output of an ensemble of wider networks, teacher network”).  This results in learning model generated in the transfer-learning model generation unit, as the training knowledge is transferred from the teacher network to the student network.)
However, Romero does not teach wherein the at least one processor is caused to further perform a means for updating the learning model generated in the transfer-learning model generation unit.
Chaoji teaches wherein the at least one processor is caused to further perform a means for updating the learning model.  (Chaoji, Top of Col 7, discloses:  “The Model Factory 102 may also constantly monitor the performance of the deployed ML model 114 over time. For instance, the Model Factory 102 may receive all or part of the model input data 116 and monitor the input data 116 for changes in the data over a predetermined period of time. Additionally, the Model Factory 102 may receive all or part of the model output data 118 and similarly evaluate the data for changes over a predetermined period of time. If the Model Factory 102 determines that the model performance has deteriorated by monitoring a significant change in the distribution of input and/or output data over the predetermined period of time, the Model Factory 102 may automatically initiate retraining of the deployed ML model 114. The Model Factory 102 may then retrain and redeploy the retrained ML model.”  Here, Chaoji discloses a means “Model Factory 102” for updating the learning model (“may then retrain and redeploy the retrained ML model”).
Chaoji and the combination of Li, Vespier, Romero, and Cheng are analogous art because they are all in the field of endeavor of machine learning.
	Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, to combine the student-teacher learning of the combination of Li, Vespier, Romero, and Cheng, with the model updating of Chaoji.  One would be motivated to do so in order to avoid deterioration of results as input to the model changes (Chaoji, Background: “Furthermore, manual interaction with the ML model is typically required throughout the life of the ML model due to changes in input data. For instance, model performance may deteriorate over time if the distribution of the input data changes significantly. Therefore, model performance may require continuous monitoring and retraining of the ML model using the same or similar manual, iterative model building process.”)

As per Claim 10, the combination of Li, Vespier, Romero, Cheng, and Chaoji teaches the apparatus of Claim 9.  Chaoji teaches wherein the means for updating the learning model is performed when in any one case among: if there is a change in a distribution of the data collected, and if a distribution of the data collected departs from a range defined by the user.  (Chaoji, Top of Col 7, discloses:  “The Model Factory 102 may also constantly monitor the performance of the deployed ML model 114 over time. For instance, the Model Factory 102 may receive all or part of the model input data 116 and monitor the input data 116 for changes in the data over a predetermined period of time. Additionally, the Model Factory 102 may receive all or part of the model output data 118 and similarly evaluate the data for changes over a predetermined period of time. If the Model Factory 102 determines that the model performance has deteriorated by monitoring a significant change in the distribution of input and/or output data over the predetermined period of time, the Model Factory 102 may automatically initiate retraining of the deployed ML model 114. The Model Factory 102 may then retrain and redeploy the retrained ML model.”  Here, Chaoji discloses a means “Model Factory 102” for updating the learning model (“may then retrain and redeploy the retrained ML model”). Here, Chaoji discloses a means “Model Factory 102” for updating the learning model (“may then retrain and redeploy the retrained ML model”) is performed when there is a change in a distribution of the data collected (“If the Model Factory 102 determines that the model performance has deteriorated by monitoring a significant change in the distribution of input and/or output data over the predetermined period of time”)).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Chaoji with the combination of Li, Vespier, Romero, and Cheng, for at least the reasons recited in Claim 1.

As per Claim 17, the combination of Li, Vespier, Romero, Cheng, and Chaoji teaches the method of Claim 13.  Romero teaches learning models generated in the transfer-learning model generation step.  Romero, Page 4 Section 2.3, discloses:  “We train the FitNet in a stage-wise fashion following the student/teacher paradigm. Figure 1 summarizes the training pipeline. Starting from a trained teacher network and a randomly initialized FitNet”.  Here, Romero discloses a learning model which has finished pre-learning (“trained teacher network”) learning unit below, so as to generate a learning model (“student network”).  Romero discloses extracting useful information from the teacher network (“the softened output of an ensemble of wider networks, teacher network”) and for forwarding the extracted information to a learning unit below (“trains a student network, from the softened output of an ensemble of wider networks, teacher network”).  This results in learning models generated in the transfer-learning model generation step, as the training knowledge is transferred from the teacher network to the student network.)
However, Romero does not teach further comprising a step of periodically updating the learning models.
Chaoji teaches further comprising a step of periodically updating the learning models. (Chaoji, Top of Col 7, discloses:  “The Model Factory 102 may also constantly monitor the performance of the deployed ML model 114 over time. For instance, the Model Factory 102 may receive all or part of the model input data 116 and monitor the input data 116 for changes in the data over a predetermined period of time. Additionally, the Model Factory 102 may receive all or part of the model output data 118 and similarly evaluate the data for changes over a predetermined period of time. If the Model Factory 102 determines that the model performance has deteriorated by monitoring a significant change in the distribution of input and/or output data over the predetermined period of time, the Model Factory 102 may automatically initiate retraining of the deployed ML model 114. The Model Factory 102 may then retrain and redeploy the retrained ML model.”  Here, Chaoji discloses updating the learning models (“may then retrain and redeploy the retrained ML model”) periodically (a “period of time”, in “a significant change in the distribution of input and/or output data over the predetermined period of time”)).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention to combine Chaoji with the combination of Li, Vespier, Romero, and Cheng, for at least the reasons recited in Claim 1.

Claims 12 and 19 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Li, Vespier, Romero, and Cheng in view of Yang (“The Optimization of NN Classification: Based on Feature Selection with Genetic Algorithm & Hidden Neuron Pruning”).
As per Claim 12, the combination of Li, Vespier, and Romero teaches the apparatus of claim 11.  Li teaches multi-feature evaluation unit (Li, Pg 3891 Section 3.1 Lines 4-6, discloses “Time–frequency analysis (TFA) has received considerable attention as a powerful tool for analyzing time-varying nonstationary signals [29]. Nonstationary signal analysis is one of the main topics in the field of fault diagnosis. The TFA can identify the time-varying features and is an effective tool to extract fault information contained in nonstationary signals” Here, Li discloses a multi-feature evaluation unit, as Li discloses, “identify the time-varying features”, wherein the word “features” is plural, and thus multiple features, and this is used to evaluate (“analysis”)).
However, Li does not teach wherein the at least one processor is caused to further perform a multi-feature combination and optimization unit for repetitively performing combination of the multiple features until an optimal combination of the multiple features according to a loss is acquired based on the learning results inputted in the multi-feature evaluation unit.
Yang teaches wherein the at least one processor is caused to further perform a multi-feature combination and optimization unit for repetitively performing combination of the multiple features until an optimal combination of the multiple features according to a loss is acquired.  (Yang, Pg 2 Last Paragraph, discloses:  “To implement GA, we require a set of random numbers as “chromosome (also called DNA)” – usually built with binary numbers – to mask each subset. Only the feature whose counter point is “1” in chromosome could be kept to next step.”  Here, Yang discloses combination of the multiple features, which Yang calls a “chromosome” which is a binary value of 0 or 1 for each feature, and thus represents a combination of multiple features.  Yang’s GA is a “genetic algorithm”, and the process described is similar to the description in the Instant Specification [0054]:  “In an embodiment, a global optimization technique such as a genetic algorithm may be used for optimization of the multi-feature combination. More specifically, a single genome can be constructed by combining an object that combines binary information of multiple features as shown in FIG. 11”.  Furthermore, Yang Pg. 2 Section 1.1.2 concludes with:  “A chromosome with better fitness (higher or lower in this own evaluation system) could be easier to be continued as parent, and then reproduce the child with better fitness”.  Here, Yang discloses an optimal combination of the multiple features (“A chromosome with better fitness”). Yang, Page 4 Section 2.3 Para 3, discloses:  “For each individual chromosome in each generation, train a temporary neural network with “Epoch” times iteration independently. After masking training input set with chromosome, input the rest feature into temporary NN as patterns. After enough iterations, to avoid the overfitting or coincidence of dataset, we should test each neural network with validation set instead of using testing set (i.e. using test set to process this step is same as “cheating”) and then extract the “fitness” – which is usually decided by what kind of loss function (MSE or Cross Entropy) the neural network applied or only the accuracy.(This report applies accuracy as the result of fitness.)”  Here, Yang discloses that the optimal combination of the multiple features is evaluated according to a loss (“’fitness’ – which is usually decided by what kind of loss function (MSE or Cross Entropy) the neural network applied”).  Yang also discloses repetitively performing combination of the multiple features, as Yang discloses “iterations”, disclosing:  “For each individual chromosome in each generation, train a temporary neural network with “Epoch” times iteration independently. After masking training input set with chromosome, input the rest feature into temporary NN as patterns. After enough iterations…”).  Therefore, Yang discloses comprising a multi-feature combination and optimization unit for repetitively performing combination of the multiple features until an optimal combination of the multiple features according to a loss is acquired.)
Yang and the combination of Li, Vespier, Romero, and Cheng are analogous art because they are all in the field of endeavor of machine learning.
	Therefore, it would have been obvious to one of ordinary skill in the art before the effective filing date of the claimed invention, to combine the student-teacher learning of the combination of Li, Vespier, Romero, and Cheng, with the genetic algorithm of Yang.  One would be motivated to save resources by automatically producing a model with better results by maximizing the fitness of the model (Yang Pg. 2 Section 1.1.2: “Compared with some other heuristic rules, Genetic Algorithms could estimate a plenty of solutions.  Since the process of information in neural network (NN) is difficult to understand, it is important to generate solutions automatically rather than only engineering by human brain. Thus, people could design a better neural network with assistance of GA… A chromosome with better fitness (higher or lower in this own evaluation system) could be easier to be continued as parent, and then reproduce the child with better fitness.”)

As per Claim 19, Claim 19 is a method claim corresponding to apparatus Claim 12.  Claim 19 is rejected for the same reasons as Claim 12.











Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LEONARD SIEGER whose telephone number is (571)272-9710. The examiner can normally be reached Mon-Fri 8:00 AM-5:00 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Ann Lo can be reached on 571-272-9767. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/L.A.S./Examiner, Art Unit 2126                                                                                                                                                                                                        
/NICHOLAS KLICOS/Primary Examiner, Art Unit 2145