DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the
first inventor to file provisions of the AIA .

Drawings
The drawings are objected to because there is a blank sheet with no figure for page 3. 
Corrected drawing sheets in compliance with 37 CFR 1.121(d) are required in reply to the Office action to avoid abandonment of the application. Any amended replacement drawing sheet should include all of the figures appearing on the immediate prior version of the sheet, even if only one figure is being amended. The figure or figure number of an amended drawing should not be labeled as “amended.” If a drawing figure is to be canceled, the appropriate figure must be removed from the replacement sheet, and where necessary, the remaining figures must be renumbered and appropriate changes made to the brief description of the several views of the drawings for consistency. Additional replacement sheets may be necessary to show the renumbering of the remaining figures. Each drawing sheet submitted after the filing date of an application must be labeled in the top margin as either “Replacement Sheet” or “New Sheet” pursuant to 37 CFR 1.121(d). If the changes are not accepted by the examiner, the applicant will be notified and informed of any required corrective action in the next Office action. The objection to the drawings will not be held in abeyance.


Specification
The disclosure is objected to because of the following informalities: 
Para. [0002] says, “on the other hand, machine learning researchers have also researches this area, and get a consensus that emotion…” change “get” to “have”.
Para. [0014] says, “and can real time recognizing emotions over speech” can be rewritten as “can recognize emotions in speech in real-time” or alike.
There appears to be double spacing at the beginning of sentences in paras. 0021-0022, 0026-0027, 0030, 0032, 0034, 0036-0038, 0040-0041, 0043-0045. 
Appropriate correction is required.

Claim Objections
Claims 1 and 36 objected to because of the following informalities: 
Line 11 on claim 1 says, “matrix based a length threshold of the feature…” add “on” before “a” as done correctly on line 15 of claim 21; apply same edit to claim 36.
Appropriate correction is required.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 5, 16, and 17 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ),
second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.

Claim 5, the term “100-400khz” in claim 5 is a relative term which renders the claim
indefinite. The term “100-400khz” is not defined by the claim, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention. As written an approach would be to interpret as 100khz-400khz; however, it is unclear whether that is appropriate given the art. For example, in regards to speech and human hearing, the specified range would not make sense as 100hz – 4khz is appropriate for the art based on teachings by Loizou, Mimicking the Human Ear, please refer to rejection of claim 5 for more details. For examination purposes and based on ranges in the art directed to human hearing, 100-400khz will be interpreted as 100hz-400khz.

Claim 16 recites the limitation "wherein said training the machine learning model
comprises" in lines 1-2 of claim 16.  There is insufficient antecedent basis for this limitation in the claim. Claim 16 is dependent on claim 8 which is dependent on claim 1 and neither give indication of said training the machine learning model. Claims 12, 15, and 17 do give indication on this matter; therefore, it is unclear of dependency of claim 16 and provides insufficient antecedent basis. 

Claim 17 recites the limitation "wherein said optimizing a plurality of model hyper
parameters further comprises" in lines 1-2 of claim 17.  There is insufficient antecedent basis for this limitation in the claim. Claim 17 is dependent on claim 9 which is dependent on claim 1 and neither give indication of said optimizing a plurality of model hyper parameters. Claim 16 does give indication on this matter; therefore, it is unclear of dependency of claim 17 and provides insufficient antecedent basis. 

Claims 27-28, 31-33, and 35 are rejected under 112(b) as claims 27-28, 31-32, and 35
are dependent on canceled claim 13, where claim 33 is dependent on claim 32 which is rejected under 112(b) for the above reason. It appears that the claims correspond to independent claim 21 directed to an apparatus claim; therefore, claims are interpreted to be dependent on apparatus claim 21 hereinafter in the Non-Final Office Action for examination purposes. 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35
U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness
rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


The factual inquiries for establishing a background for determining obviousness under
35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

This application currently names joint inventors. In considering patentability of the
claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 1-2, 9-10, 12, 15, 17-18, 21, 27-28, 31, 32, and 36 are rejected under 35 U.S.C.
103 as being unpatentable over Howard (US Pub. No. 2019/0074028 A1) in view of Fabian Mörchen, MusicMiner: Visualizing timbre distances of music as topographical maps hereinafter Fabian.
Regarding claim 1, Howard teaches a method for emotion recognition from speech,
comprising: 
receiving an audio signal (Para. 48, solutions regarding mental health problems may
include tools that are capable of automatically evaluating a voice signal, such as in call to a suicide hotline, to determine suicidal risk through speech analysis. Embodiments may provide real-time feedback to counsellors and rapid situation assessment to help the decision-making process i.e. receives audio signal for evaluation); 
performing data cleaning on the received audio signal (Para, 50, To analyses and generate feedback from both voices, the individual in crisis and the counsellor, pre-processing 104 may include the separation of the two speaker voices, as well as the corrections needed for its implementation, namely time delay and gain difference between the two recorded channels, Pre-emphasis may also be performed here as it is used in nearly all speech analysis systems where para. 7 indicates that pre-emphasis is filtering on the audio signal to generate a pre-emphasis filtered signal); 
slicing the cleaned audio signal into at least one segment (Para, 73, Signal 402 may be input to feature extraction 404, in which features of the signal may be extracted. The objective of feature extraction is to acquire a set of discriminative speech indicators, which differentiate between speech and silence i.e. feature extracting with voice activity detection separates e.g. slices the cleaned audio signals; furthermore, para. 82 discusses short-time analysis framing i.e. the speech signal may be partitioned into short-time frames to achieve local stationarity, Each frame may be considered individually and characterized by a distinct feature vector); 
performing feature extraction on the at least one segment to extract a plurality of Mel frequency cepstral coefficients from the at least one segment (Para. 93-94, feature extraction, the features may be divided according to the domain in which they are computed, either time (energy, zero crossing rate, etc.), frequency (fundamental frequency, Mel Frequency Cepstrum Coefficients (MFCC) i.e. after the framing conducted); 
performing feature padding to pad the plurality of Mel frequency cepstral coefficients into a feature matrix based a length threshold of the feature matrix (Para. 154, The choice of the frame length and frame period may be determined as described above. A frame of 25 ms with 10 ms between each frame is widely used and embodiments may use such. The signal may be first padded with the necessary amount of zeros to obtain frames having all an equal number of samples independently of the recording duration; furthermore, para. 83, indicates frame lengths are commonly between 20 and 60 ms and may be set to 25 ms i.e. threshold of length for feature matrix); and 
performing machine learning inference on the feature matrix to recognize the emotion indicated in the audio signal (Para. 177, The Mel Frequency Cepstral Coefficients (MFCCs) are well-known acoustic feature representations, which have proven their efficiency in similar tasks, such as emotion analysis and severity depression estimation, where the classification module may be implements using TensorFlow i.e. machine learning and deep learning in the recent years comes in parallel with several improvements, see para. 142 as machine learning is performed on the feature matrix of the extracted features to recognize the emotion indicated in the audio signal).
However, Howard fails to explicitly disclose: 
And a plurality of Bark frequency cepstral coefficients
In a related field of endeavor (e.g. clustering with the use of feature extraction, see abstract), Fabian discusses the Centroid and Flux were calculated in 5 variants using the raw Spectrum and four different frequency scalings (Bark, ERB, Mel, Octave), see para. 1 of pg. 9. The time series of the MFCC vectors per frame provide a de-correlated description of the short time spectra. But the Mel scale is not the only psychoacoustic frequency scale. We created variants of the MFCC using the Bark [Zwicker and Stevens, 1957], Equivalent Rectangular Bandwidth (ERB) [Moore and Glasberg, 1996], and Octave scales. The corresponding features are called BFCC, EFCC, and OFCC, respectively. The log transformed magnitudes of all frequency bands are used as additional low level features i.e. MFCCs and BFCCs features are extracted, see para. 4 of pg. 9 and para. 3 of pg. 10.
Modifying Howard to use the techniques disclosed by Fabian discloses:
performing feature extraction on the at least one segment to extract a plurality of Mel frequency cepstral coefficients and a plurality of Bark frequency cepstral coefficients from the at least one segment (e.g. Howard’s emotion recognition method now also including the feature of performing feature extraction on the at least one segment to extract a plurality of Mel frequency cepstral coefficients (Howard, see above) and Bark frequency cepstral coefficients as taught by Fabian, see para. 4 of pg. 9 and para. 3 of pg. 10).
It would have been obvious to one of ordinary skill in the art at the time the invention
was filed to apply the teachings of Fabian to the method of Howard. Doing so would
have been predictable to one of ordinary skill in the art given the similar nature between the
two disclosures, extracting features from voice signals. Bark Frequency Cepstral Coefficients (BFCCs) are regarded as variants of MFCCs and mainly just differ in the applied psychoacoustic scale, see para. 1 of pg. 9. Further, doing so would have provided the users of Howard, with the added benefits of enhancing the ability to cluster different genres by the different bands that are registered for the features as bark recognized the best in hip-hop while acoustic and metal correlated best with mel, see paras. 1-3 of pg. 26; furthermore, para. 2 of pg. 42 discusses the feature selection methods used evaluates features independently. But variables that seem useless on their own can actually increase classification performance when used in combination with others.

Regarding claim 2, Howard in view of Fabian teaches the method of claim 1 (see claim 1 above), in addition, Howard teaches:
wherein said performing data cleaning on the received audio signal further comprises at least one of the following: 
removing noise of the audio signal (Para. 77, Human hearing is most sensitive in a frequency range between 500 and 4000 Hz and SPL going from 35 to 80 dB. The different sound pressure levels of human hearing are shown in Table 1. In embodiments, the value may be set to for example, 55 dB, which should catch the talking while removing the noisy parts, as shown in FIG. 5. For example, the sound level of a channel 1 is shown relative to the threshold value of 55 dB i.e. removing periods of above or below a designated threshold where the samples are considered noise in regards to focus area of human hearing and SPL); 
removing silence in the beginning and end of the audio signal based on a silence threshold (Para. 77, above limitation still meets as the silence may be present in the beginning and end relevant to the threshold set; furthermore, para. 73 indicates Signal 402 may be input to feature extraction 404, in which features of the signal may be extracted. The objective of feature extraction is to acquire a set of discriminative speech indicators, which differentiate between speech and silence as cepstral coefficients may also be used as VAD features i.e. thresholding with voice activity detection may differentiate between speech periods and silences as to remove noisy periods i.e. silences and speech signals over the designated threshold); and 
removing sound clips in the audio signal shorter than a predefined threshold (Para. 77, above limitation still meets as it removes the noisy parts i.e. removing sound clips which are described as being above and below the threshold i.e. silences and background noise; in addition, shorter than a predefined threshold would be below as shorter means smaller).

Regarding claim 9, Howard in view of Fabian teaches the method of claim 1 (see claim 1 above), in addition, Howard teaches:
 wherein said performing feature padding further comprises: 
determining whether the length of the feature matrix reaches the length threshold
(Para. 154, The choice of the frame length and frame period may be determined as described above. A frame of 25 ms with 10 ms between each frame is widely used and embodiments may use such. The signal may be first padded i.e. to determine if padding is needed a determination whether the length of the feature matrix reaches the length threshold is conducted); 
when the length of the feature matrix does not reach the length threshold, calculating
the amount of data needs to be added to the feature matrix to reach the length threshold (Para. 154, The signal may be first padded with the necessary amount of zeros to obtain frames having all an equal number of samples independently of the recording duration i.e. this indicates that a calculation of amount of data to be added to feature matrix to reach the length threshold is conducted as it says necessary amount of zeros); and 
based on the calculated data amount, padding features extracted from a following
segment into the feature matrix to spread the feature matrix (Para. 154, To achieve local stationarity of the speech properties, the signal may be split into overlapping frames and the feature extraction may be performed on each frame i.e. overlapping frames are following segments padded into the feature matrix to spread the feature matrix and meet the length threshold; furthermore, As shown in FIG. 7, the time between two successive frames is represented by the frame period, T.sub.frame 706. The percentage of overlap between consecutive frames is derived from Equation 15. A typical frame period is of 10 ms, leading to an overlap of 60% in our case. The overall process is illustrated in FIG. 7, see para. 83).

Regarding claim 10, Howard in view of Fabian teaches the method of claim 1 (see claim 1 above), in addition, Howard teaches:
wherein said performing feature padding further comprises: 
determining whether the length of the feature matrix reaches the length threshold
(Para. 154, The choice of the frame length and frame period may be determined as described above. A frame of 25 ms with 10 ms between each frame is widely used and embodiments may use such. The signal may be first padded i.e. to determine if padding is needed a determination whether the length of the feature matrix reaches the length threshold is conducted); 
when the length of the feature matrix does not reach the length threshold, calculating
the amount of data needs to be added to the feature matrix to reach the length threshold (Para. 154, The signal may be first padded with the necessary amount of zeros to obtain frames having all an equal number of samples independently of the recording duration i.e. this indicates that a calculation of amount of data to be added to feature matrix to reach the length threshold is conducted as it says necessary amount of zeros); and 
based on the calculated data amount, reproducing the available features in the feature
matrix to spread the feature matrix (Para. 117, The solution relates to the method of padding. The idea is to repeat the border values or extend the feature vector with zeros in order to always be able to compute the difference i.e. padding is referred to as adding zeros; however, it is discussed where it may be extended to repeat the border values as to reproduce and meet the length threshold by the calculated data amount, necessary amount to meet the threshold).

Regarding claim 12, Howard in view of Fabian teaches the method of claim 1 (see claim 1 above), in addition, Howard teaches:
wherein said performing machine learning inference on the feature matrix further comprises normalizing and scaling the feature matrix (Para. 103, The LLD is also normalized, to acquire a descriptor which is independent of the frame length i.e. the LLD may be a MFCC for example as stated in para. 95; furthermore, para. 76 indicates scaling of the feature matrix In embodiments. the sound pressure level may be chosen as a feature vector and a decibel threshold as decision rule. Once the signal of the channel containing only the counsellor's voice goes beyond the pre-defined threshold, the mixed voices channel may be cut. The sound pressure level is expressed in dB and is computed by taking the logarithm in base 10 of the ratio between the sound pressure (signal amplitude) and a reference value (p.sub.ref=2.Math.10.sup.−5, the lowest human hearable sound), the whole finally multiplied by 20, see Equation 10. No conceptual differences are induced by this scaling and it is meant to place the hearing threshold at 0 dB; furthermore, the MFCCs used in the feature matrix are being scaled according to the Mel-scale as expressed in para. 105).

Regarding claim 15, Howard in view of Fabian teaches the method of claim 1 (see claim 1 above), in addition, Howard teaches:
further comprising training a machine learning model to perform the machine learning inference (Para, 52, Model training and classification 108 may include training of a classification model using a selected dataset, and later, classification of input audio signals to generate predicted label output 110 where the model may be a neural network as specified by para. 74).

Regarding claim 17, Howard in view of Fabian teaches the method of claim 9 (see claim 9 above), in addition, Howard teaches in view of the 112(b) rejection above:
wherein said optimizing a plurality of model hyper parameters (Para. 167, The
AdamOptimizer is an improvement of the classical. Stochastic Gradient Descent. It takes into consideration the moving averages of the parameters (momentum), which helps to dynamically adjust the hyperparameters i.e. AdamOptimizer is optimizing hyperparameters) further comprises:
generating the plurality of hyper parameters (Para. 167, The AdamOptimizer is an
improvement of the classical. Stochastic Gradient Descent. It takes into consideration the moving averages of the parameters (momentum), which helps to dynamically adjust the hyperparameters i.e. AdamOptimizer is optimizing hyperparameters by generating them as they are being adjusted); 
training the machine learning model on sample data with the plurality of hyper
parameters (Para. 168, the model is trained multiple times with a diverse feature set extracted with the new implementation. All of the hyperparameters are similar for each run. Keeping them constant throughout the trials should help to define and show the relevance of the selected features); and 
finding the best machine learning model during training the machine learning model
(Para 168, the model is trained multiple times with a diverse feature set extracted with the new implementation. All of the hyperparameters are similar for each run. Keeping them constant throughout the trials should help to define and show the relevance of the selected features; furthermore, para. 167 indicates that adjustments are made to the hyperparameters as to optimize i.e. find the best machine learning model during training).

Regarding claim 18, teaches the method of claim 9 (see claim 9 above), in addition, Howard teaches:
wherein the model hyper parameters are model shapes (Para. 140, reflects that hyperparameters that shape the model is the Connectionist Temporal Classification (CTC) such as where a BLSRM-RNN is trained with a CTC objective function which is the model architecture in that it takes into account shape where  The network weights are then updated by computing the error rate from the predicted and true character sequences, see para. 140; furthermore, para. 129 discusses the use of Tensorflow threading and queues architecture as improvement of the model performance can be achieved and shapes the model; moreover, para. 167 discusses the AdamOptimizer which shapes the model as well).

Regarding claim 21, is directed to an apparatus claim corresponding to the method claim presented in claim 1 and is rejected under the same grounds stated above regarding claim 1. In addition, Howard teaches:
a processor (Para. 184, the present systems and methods may include implementation on a system or systems that provide multi-processor, multi-tasking, multi-process, and/or multi-thread computing, as well as implementation on systems that provide only single processor, single thread computing. Multi-processor computing involves performing computing using more than one processor); and
a memory (Para. 186, The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing i.e. examples of non-transitory computer-readable storage medium given);
wherein computer programmable instructions for implementing a method for emotion recognition from speech are stored in the memory, and the processor is configured to perform the computer programmable instructions to (Para. 185, The present invention may be a system, a method, and/or a computer program product at any possible technical detail level of integration. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention. The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device):

Regarding claim 27, is directed to an apparatus claim corresponding to the method claim presented in claim 9 and is rejected under the same grounds stated above regarding claim 9.

Regarding claim 28, is directed to an apparatus claim corresponding to the method claim presented in claim 10 and is rejected under the same grounds stated above regarding claim 10.

Regarding claim 31, teaches the apparatus claim 21 (see claim 21 above), in addition, teaches:
wherein said performing machine learning inference on the feature matrix further comprises feeding the feature matrix into a machine learning model (Para. 134, Before feeding the Neural Network, the input speech data are converted into Mel-Frequency Cepstral Coefficients (MFCC). It is this matrix of feature vectors that is given as input to the model i.e. machine learning model inferences on feature matrix fed into the model where para. 7 indicates that neural networks are used).

Regarding claim 32, is directed to an apparatus claim corresponding to the method claim presented in claim 15 and is rejected under the same grounds stated above regarding claim 15.

Regarding claim 36, is directed to a non-transitory computer-readable storage medium  corresponding to the method claim presented in claim 1 and apparatus claim 21 and is rejected under the same grounds stated above regarding claim 1 and 21.

Claim 5 is rejected under 35 U.S.C. 103 as being unpatentable over Howard in view of
Fabian and further in view of Philipos C. Loizou, Mimicking the Human Ear, hereinafter Loizou, in view of 112(b) rejection stated above.
Regarding claim 5, Howard in view of Fabian teaches the method of claim 1 (see claim 1 above), in addition, Howard teaches:
wherein said performing data cleaning on the received audio signal further comprises performing band-pass filtering on the received audio signal to control the frequency of the audio signal (Para. 104, Mel filtering. Human auditory perception is far from being linear, meaning that it is not equally responsive to all frequency components. Human hearing is more sensitive to lower frequencies, especially below 1000 Hz. Mel-filtering is like the human ear. It behaves as a series of band-pass filters, intensifying certain frequency bands. Those filters are irregularly distributed over the frequency range, with a greater concentration on the low rather than the high frequency side. Multiple algorithms have been implemented to represent the most realistic and relevant way in which human auditory system works, it performing band-pass filtering on the received audio signal to control the frequency of the audio signal according to human hearing to represent the most realistic and relevant way in which human auditory system works).
However, Howard fails to explicitly disclose:
to be 100-400kHz.
In a related field of endeavor (e.g. analyzing speech signal and auditory system, see background) Loizou teaches the Vienna/3M device in which it uses a frequency-equalization filter i.e. band-pass filter between 100-4000hz, please refer to 112(b) rejection stated above. It states that the device ensures that all frequencies in the range of 100hz to 4khz, which are very important for understanding speech, are audible to the patients, see paras. 1-2 on pg. 109 under Vienna/3M Device header. 
Modifying Howard in view of Fabian to use the techniques disclosed by Loizou discloses:
wherein said performing data cleaning on the received audio signal further comprises performing band-pass filtering on the received audio signal to control the frequency of the audio signal to be 100-400kHz (e.g. Howards emotion recognition method in view of Fabian where it conducts a band-pass filter to control the frequency of the audio signal now also including the feature where it controls the frequency to be 100hz – 4khz as taught by Loizou, see paras. 1-2 on pg. 109 under Vienna/3M Device header).
It would have been obvious to one of ordinary skill in the art at the time the invention
was filed to apply the teachings of Loizou to the method of Howard in view of Fabian. Doing so would have been predictable to one of ordinary skill in the art given the similar nature between the disclosures, speech signals and human auditory system. Further, doing so would have provided the users of Howard in view of Fabian, with the added benefits of having a filter with ranges that are very important for understanding speech that are audible to patients; furthermore, it helps to preserve fine temporal variations in the speech signal, as recognized by Loizou, see paras. 1-2 on pg. 109 under Vienna/3M Device header and second para. of pg. 110.


Claims 8, 16, and 33 are rejected under 35 U.S.C. 103 as being unpatentable over
Howard in view of Fabian further in view of Sieracki (US Pat. No. 9,691395 B1).
Regarding claim 8, Howard in view of Mistrovic teaches the method of claim 1 (see claim 1 above).
However, Howard fails to explicitly disclose:
wherein the length threshold is not less than 1 second.
In a related field of endeavor (e.g. classification of speech segments, see abstract), Sieracki discloses, The partitioning of each speech segment into sub-segments involves a tradeoff between providing more feature vectors for each speech segment and maintaining large enough sub-segments to capture characteristic signature aspects of a speaker's vocalization. FIG. 3b shows graphic plots of both total classification accuracy and worst case accuracy (in percentage) vs. sub-segment time size (in seconds). As judged by both total accuracy and worst case accuracy for individual speakers, the resulting plots reveal for this dataset that a sub-segment size of approximately 3 seconds, see lines 38-50 on col. 27. 
Modifying Howard in view of Fabian to include the features disclosed by Sieracki discloses:
wherein the length threshold is not less than 1 second (e.g. Howard’s emotion recognition method in view of Fabian now also including the feature wherein the length threshold is not less than 1 second as taught by Sieracki, see lines 38-50 on col. 27).
It would have been obvious to one of ordinary skill in the art at the time the invention
was filed to apply the teachings of Sieracki to the method of Howard in view of Fabian. Doing so would have been predictable to one of ordinary skill in the art given the similar nature between the disclosures, extracting features from voice signals. Further, doing so would have provided the users of Howard in view of Fabian, with the added benefits of optimal classification performance for each feature vector. In fact, increasing the segmentation to include 3 second segments that overlap by ½ second (i.e., 13 feature vectors per segment) led to elimination of the WG misclassification shown in the second table of FIG. 3a, which yielded a 98.75% accuracy rate as recognized by Sieracki, see lines 50-53 on col. 27.

Regarding claim 16, Howard in view of Fabian and further in view of Sieracki teaches the method of claim 8 (see claim 8 above), in addition, Howard teaches in view of the 112(b) rejection above:
wherein said training the machine learning model (Para, 52, Model training and classification 108 may include training of a classification model using a selected dataset, and later, classification of input audio signals to generate predicted label output 110 where the model may be a neural network as specified by para. 74) comprises: 
optimizing a plurality of model hyper parameters (Para. 167, The AdamOptimizer is an improvement of the classical. Stochastic Gradient Descent. It takes into consideration the moving averages of the parameters (momentum), which helps to dynamically adjust the hyperparameters i.e. AdamOptimizer is optimizing hyperparameters); 
selecting a set of model hyper parameters from the optimized model hyper parameters (Para. 167, as it is adjusting the hyperparameters, selection is being done from the AdamOptimizer to perform with the corresponding adjustments); and 
measuring the performance of the machine learning model with the selected set of model hyper parameters (Para. 170, However, it is relevant to remind that the validation is only performed on a single sample and the cost can easily change according to the complexity of this sample. The average time per epoch and thus the total process is a little bit longer with the full version of the feature set, which absolutely makes sense, since more computations are required for the additional feature representations. However, the time difference of the total process remains quite small. Furthermore, comparing the learning curves shown in FIG. 26, the Label Error Rate (LER), described above, used as an accuracy reference i.e. as the epochs are performed with adjusted optimized hyperparameters, the model is evaluated according to the Label Error Rate (LER), us as an accuracy reference).

Regarding claim 33, is directed to an apparatus claim corresponding to the method claim presented in claim 16 and is rejected under the same grounds stated above regarding claim 16.

Claims 19 and 35 are rejected under 35 U.S.C. 103 as being unpatentable over
Howard in view of Fabian further in view of Amini et al. (US Pat. No. 9,812,151 B1) hereinafter Amini.
Regarding claim 19, Howard in view of Fabian teaches the method of claim 1 (see claim 1 above), in addition, Howard teaches:
Determining an emotional state based on inferences on the feature matrix, see abstract; furthermore, where machine learning is used as such through neural networks, see para. 7. Moreover, para. 46 specifies, Depression and emotions are inherently associated. Some robust correlations have been found between the behavior of depressed persons and the three affective dimensions, which are arousal, dominance and valence. However, it remains relatively difficult to quantitatively evaluate human emotions.
However, Howard fails to explicitly disclose:
wherein said performing machine learning inference on the feature matrix further comprises generating an emotion score for at least one of arousal, temper and valence.
In a related field of endeavor (e.g. determining a user’s emotion, mood, and/or personality based on verbal expressions, see abstract), Amini discloses classifying the user utterance into one or more classes involves classifying the user utterance as negative, neutral and/or positive. In various embodiments, the one or more classes includes additional classes to, for example, provide a higher level of fidelity in selecting classes (e.g., very positive, very negative). In some embodiments, the classifications negative, neutral and/or positive are each assigned a polarity value. The polarity value can be float values in the range of −0.75, 0.0, and +0.75. In some embodiments, the polarity value is interpreted as a sentiment score, which can be used as the pleasure value of the PAD model. In some embodiments, the polarity value is aggregated with the pleasure value of the sentiment scores (e.g., the word and/or sentence sentiment scores as described above). In various embodiments, the classification value is aggregated with the pleasure value, the arousal value and/or the dominance value of the PAD model values, see lines 1-5 on col. 8; furthermore, lines 46-49 on col. 7, classifying the user utterance into one or more classes can involve deep learning and Recurrent Neural Network (RNN) methods (e.g., as discussed in further detail below)…
Modifying Howard in view of Fabian to use the techniques disclosed by Amini discloses:
wherein said performing machine learning inference on the feature matrix further comprises generating an emotion score for at least one of arousal, temper and valence (e.g. Howard’s emotion recognition method in view of Fabian now also including the feature where an emotion score for at least one of arousal, temper, and valance is generated as taught by Amini, see lines 46-49 on col. 7 and lines 1-5 on col. 8.
It would have been obvious to one of ordinary skill in the art at the time the invention
was filed to apply the teachings of Amini to the method of Howard in view of Fabian. Doing so would have been predictable to one of ordinary skill in the art given the similar nature between the disclosures, determining emotion from speech. Further, doing so would have provided the users of Howard in view of Fabian, with the added benefits of better understanding by the virtual agent of user utterances based on affective context (e.g., emotion, mood, personality and/or satisfaction of the user) as recognized by Amini, see lines 22-24 on col. 2. Furthermore, as recognized by Amini, A higher number of dialogue performance metrics used in the OCC cognitive appraisal model, a higher accuracy can result with regard to the prediction of the user's emotion and a number of emotions in the possible emotions list can be smaller, see lines 40-45 on col. 8.

Regarding claim 35, is directed to an apparatus claim corresponding to the method claim presented in claim 19 and is rejected under the same grounds stated above regarding claim 19.
Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant’s
disclosure.
Yanmin Qian, VERY DEEP CONVOLUTIONAL NEURAL NETWORKS FOR ROBUST SPEECH
RECOGNITION, describes, appropriate input padding and input feature map selection strategies are developed. In addition, an adaptation framework using joint training of very deep CNN with auxiliary features i-vector and fMLLR features is developed. These modifications give substantial word error rate reductions over the standard CNN used as baseline, see abstract. Where section 4.3 in pg. 484 describes Padding in feature maps can better use the border information of feature maps by the neural network, which is beneficial for the final performance along various strategies. 

Che Guan, Very Short-Term Load Forecasting: Wavelet Neural Networks With Data Pre
Filtering, teaches, to further reduce distortion effects, padding strategies (e.g., zero-padding, periodic extension, and symmetrization) are tested. According to the test in Example 2 in Section V, symmetrization, a boundary replication which pads the loads by adding points symmetric to the original, is demonstrated to be the best strategy. This also corresponds with the conclusion on [31, p. 263]. These parameters are determined through training, validation, and test processes in a three-way data split, see paras. 1-2 on pg. 35. Lengths of the features requiring padding may be referred to in para. 4 of pg. 34.

Dalibor Mitrovic, Features for Content-Based Audio Retrieval, teaches automatic speech
recognition with the use of Fourier transform, see first two paragraphs under 5.5.1 pg. 50), Fabian discusses aggregation of data including binning of frequencies, e.g. spectral binning into Bark- and Mel bands, see third paragraph of pg. 18. Variations of MFCCs. In the course of time several variations of MFCCs have been proposed. They mainly differ in the applied psychoacoustic scale. Instead of the Mel-scale, variations employ the Bark- [32], ERB- [33], and octave-scale [131]. A typical variation of MFCCs are Bark-frequency cepstral coefficients (BFCCs). However, cepstral coefficients based on the Mel-scale are the most popular variant used today, even if there is no theoretical reason that the Mel-scale is superior to the other scales, see second paragraph of pg. 51. Bark Frequency Cepstral Coefficients (BFCCs) are regarded as typical variations of MFCCs and mainly just differ in the applied psychoacoustic scale. Added benefits of enhancing the ability to simulate the human auditory system and improves performance in noisy environments as recognized by Li, see paragraph 6 on pg. 51.

Liu (CN 108305642 A) teaches, an emotion information of determining method and
device. wherein, the method comprises: obtaining the target audio from the target audio identifying the first text information, target audio with audio characteristics, first text information with the text features based on the first text information with the text features and target audio has speech characteristic determination target emotion information of the target audio. The invention solves the technical problem of emotion information in relevant technology cannot accurately identify the speaker, see abstract.

Heigold (US Pub. No. 2017/0069327 A1) teaches, systems, methods, devices, and other
techniques related to speaker verification, including (i) training a neural network for a speaker verification model, (ii) enrolling users at a client device, and (iii) verifying identities of users based on characteristics of the users' voices. Some implementations include a computer-implemented method. The method can include receiving, at a computing device, data that characterizes an utterance of a user of the computing device. A speaker representation can be generated, at the computing device, for the utterance using a neural network on the computing device. The neural network can be trained based on a plurality of training samples that each: (i) include data that characterizes a first utterance and data that characterizes one or more second utterances, and (ii) are labeled as a matching speaker’s sample or a non-matching speakers’ sample, see abstract. Where para. 66 discusses the use of fixed lengths for utterances and the use of cropping or padding to a fixed length. 

Hetherington (US Pub. No. 2014/0376742 A1) teaches, A subband filter may process the
received microphone signal 118 to extract frequency information. The subband filter may be accomplished by various methods, such as a Fast Fourier Transform (FFT), critical filter bank, octave filter band, or one-third octave filter bank. Alternatively, the subband analysis may include a time-based filter bank. The time-based filter bank may be composed of a bank of overlapping bandpass filters, where the center frequencies have non-linear spacing such as octave, 3.sup.rd octave, bark, mel, or other spacing techniques. The one or more energy levels may be calculated for each frequency bin or band of the subband filter. The resulting balance gains may be filtered, or smoothed, over time and/or frequency, see para. 36.



Park (CN 1979491 A) teaches, a method of allowing multi-media player to analyze the
character of music documents to classify music documents and its system. Said method of classifying music documents comprises: executing pre-treatment to execute decoding and normalizing at least one part of input music documents; extracting one or multiple characters from the pre-treated data; and confirming the mood of input music documents through using extracted character. Specifically, using information of the Bark frequency method may include tone to replace the Mel frequency method so as to greatly improve performance. In addition, using BFCC versions to very greatly improves the correctness of the classification.

Any inquiry concerning this communication or earlier communications from the
examiner should be directed to JONATHAN E AMAYA HERNANDEZ whose telephone number is (571)272-2484. The examiner can normally be reached Monday - Friday 7:30 am - 3:30 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on 571-272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.

/J.E.A./             Examiner, Art Unit 2655                   

/ANDREW C FLANDERS/             Supervisory Patent Examiner, Art Unit 2655