DETAILED ACTION
This office action is in response to Applicant’s submission filed on 10/8/2021. Claims 1 – 18, are pending in the application. As such claims 1-18 were examined.

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Priority
Acknowledgment is made of applicant’s claim for foreign priority under 35 U.S.C. 119 (a)-(d). The certified copy has been filed in parent Application No. CN201710725390.2, filed on 8/22/2017.

Information Disclosure Statement
The information disclosure statement(s)(IDS) submitted on the following dates 10/8/2021, and 9/9/2022 have been considered by the examiner. 

Double Patenting
The nonstatutory double patenting rejection is based on a judicially created doctrine grounded in public policy (a policy reflected in the statute) so as to prevent the unjustified or improper timewise extension of the “right to exclude” granted by a patent and to prevent possible harassment by multiple assignees. A nonstatutory double patenting rejection is appropriate where the conflicting claims are not identical, but at least one examined application claim is not patentably distinct from the reference claim(s) because the examined application claim is either anticipated by, or would have been obvious over, the reference claim(s). See, e.g., In re Berg, 140 F.3d 1428, 46 USPQ2d 1226 (Fed. Cir. 1998); In re Goodman, 11 F.3d 1046, 29 USPQ2d 2010 (Fed. Cir. 1993); In re Longi, 759 F.2d 887, 225 USPQ 645 (Fed. Cir. 1985); In re Van Ornum, 686 F.2d 937, 214 USPQ 761 (CCPA 1982); In re Vogel, 422 F.2d 438, 164 USPQ 619 (CCPA 1970); In re Thorington, 418 F.2d 528, 163 USPQ 644 (CCPA 1969).
A timely filed terminal disclaimer in compliance with 37 CFR 1.321(c) or 1.321(d) may be used to overcome an actual or provisional rejection based on nonstatutory double patenting provided the reference application or patent either is shown to be commonly owned with the examined application, or claims an invention made as a result of activities undertaken within the scope of a joint research agreement. See MPEP § 717.02 for applications subject to examination under the first inventor to file provisions of the AIA  as explained in MPEP § 2159.  See MPEP §§ 706.02(l)(1) - 706.02(l)(3) for applications not subject to examination under the first inventor to file provisions of the AIA . A terminal disclaimer must be signed in compliance with 37 CFR 1.321(b). 
The USPTO Internet website contains terminal disclaimer forms which may be used. Please visit www.uspto.gov/patent/patents-forms. The filing date of the application in which the form is filed determines what form (e.g., PTO/SB/25, PTO/SB/26, PTO/AIA /25, or PTO/AIA /26) should be used. A web-based eTerminal Disclaimer may be filled out completely online using web-screens. An eTerminal Disclaimer that meets all requirements is auto-processed and approved immediately upon submission. For more information about eTerminal Disclaimers, refer to www.uspto.gov/patents/process/file/efs/guidance/eTD-info-I.jsp.
Claims 1-5, are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1, 4 and 12 of U.S. Patent No. US11189302B2. Claims 7-11, 13-17 are mirrored claims from 1-5 and such they are also being rejected on the ground of nonstatutory double patenting similarly. The claims of the issued patent are narrower in scope than that of the instant application. Therefore, the claims of the issued patent anticipate the claims of the instant application. Please see the claim mapping 1-5 of instant application vs issue patent below as well as the claim mappings for the individual claims.


Instant Application 17497298

Issued patent US11189302B2
1
A speech emotion detection method performed by a device, comprising:
1
A speech emotion detection method, comprising:


1.a
obtaining, by a processor, to-be-detected speech data; 
1.aa
extracting speech features of to-be-detected speech data to form a speech feature matrix corresponding to the to-be-detected speech data;
1.b
generating speech frames based on framing processing and the to-be-detected speech data;


1.c
extracting speech features corresponding to the speech frames to form a speech feature matrix corresponding to the to-be-detected speech data;
1.bb
inputting the speech feature matrix to an emotion state probability detection model, the emotion state probability detection model being trained based on a recurrent neural network (RNN) model;
1.d
inputting the speech feature matrix to an emotion state probability detection model, the emotion state probability detection model being trained based on a deep neural network (DNN) model;
1.cc
based on the speech feature matrix and the RNN model, generating an emotion state probability matrix corresponding to the to-be-detected speech data;
1.e
based on the speech feature matrix and the emotion state probability detection model, generating an emotion state probability matrix corresponding to the to-be-detected speech data by:


1.f
obtaining an input layer node sequence according to the speech feature matrix;


1.g
projecting the input layer node sequence to obtain a hidden layer node sequence corresponding to a first hidden layer;


1.h
executing non-linear mapping logic based on a first set of parameters for the first hidden layer, the first set of parameters comprising the hidden layer node sequence for the first hidden layer, and weights and deviations of neuron nodes corresponding to the first hidden layer;


1.i
obtaining, in response to executing of the non-linear mapping logic based on the first set of parameters, a hidden layer node sequence for a second hidden layer;


1.j
successively obtaining, until identifying an output layer, hidden layer node sequences for subsequent hidden layers, respectively,


1.k
in response to repeated corresponding executions of the non-linear mapping logic based on respective sets of parameters for the subsequent hidden layers, each of the respective sets of parameters comprising a hidden layer node sequence for a previous corresponding hidden layer, and weights and deviations corresponding to neuron nodes for the previous corresponding hidden layer; and


1.l
obtaining an emotion state probability matrix that corresponds to the to-be-detected speech data and that is output by the output layer;


1.m
inputting the emotion state probability matrix and the speech feature matrix to an emotion state transition model;
1.dd
generating, based on the emotion state probability matrix, the speech feature matrix, and an emotion state transition model, an emotion state sequence corresponding to the to-be-detected speech data; and
1.n
generating, based on the emotion state probability matrix, the speech feature matrix, and the emotional state transition model, an emotion state sequence corresponding to the to-be-detected speech data; and
1.ee
determining, based on the emotion state sequence, an emotion state corresponding to the to-be-detected speech data.
1.o
determining, based on the emotion state sequence, an emotion state corresponding to the to-be-detected speech data.








2
The method of claim 1, wherein generating the emotion state probability matrix corresponding to the to-be-detected speech data comprises:


2.aa
obtaining an input layer node sequence according to the speech feature matrix;
1.f
obtaining an input layer node sequence according to the speech feature matrix;
2.bb
projecting the input layer node sequence to obtain a hidden layer node sequence for a first hidden layer; and
1.g
projecting the input layer node sequence to obtain a hidden layer node sequence corresponding to a first hidden layer;
2.cc
obtaining a hidden layer node sequence for a next hidden layer;
1.i
obtaining, in response to executing of the non-linear mapping logic based on the first set of parameters, a hidden layer node sequence for a second hidden layer;
2.dd
successively obtaining, until identifying an output layer, hidden layer node sequences for subsequent hidden layers, respectively; and
1.j
successively obtaining, until identifying an output layer, hidden layer node sequences for subsequent hidden layers, respectively,
2.ee
obtaining an emotion state probability matrix that corresponds to the to-be-detected speech data and that is output by the output layer.
1.l
obtaining an emotion state probability matrix that corresponds to the to-be-detected speech data and that is output by the output layer;








3
The method of claim 1, wherein determining, based on the emotion state sequence, the emotion state corresponding to the to-be-detected speech data further comprises:


3.aa
extracting non-silent speech sub-segments from silent frames of the to-be-detected speech data; and
4.b
segmenting the to-be-detected speech data according to the silent frame to obtain non-silent speech sub-segments; and
3.bb
determining, based on the emotion state sequence corresponding to the non-silent speech sub-segments, emotion states corresponding to the non-silent speech sub-segments.
4.c
determining, based on the emotion state sequences corresponding to the non-silent speech sub-segments, emotion states corresponding to the non-silent speech sub-segments.








4
The method of claim 3,


4.aa
wherein the emotion state sequence comprises a silent state, and wherein each of the silent frames corresponds to the silent state.
4.a
detecting a silent frame in the to-be-detected speech data based on a silent state comprised in the emotion state sequence;








5
The method of claim 1, further comprising:


5.aa
extracting training speech features corresponding to the training speech frames to form a training speech feature matrix;
5.c
extracting training speech features corresponding to the training speech frames to form a training speech feature matrix;
5.bb
obtaining standard emotion state labels corresponding to training speech frames;
5.d
obtaining a standard emotion state label corresponding to the training speech frame, wherein the standard emotion state label comprises a silent label;
5.cc
training the emotion state probability detection model based on the training speech feature matrix being an input of the emotion state probability detection model and standard emotion state labels corresponding to the training speech features being a predetermined output of the emotion state probability detection model;
5.e
training the emotion state probability detection model based on the training speech feature matrix being an input of the emotion state probability detection model and standard emotion state labels corresponding to the training speech features being a predetermined output of the emotion state probability detection model;
5.dd
determining an error measurement satisfies a predetermined condition, the error measurement based on a probability for the emotion state and a predetermined probability for the standard emotion state labels; and
5.f
determining an error measurement satisfies a predetermined condition, the error measurement based on a probability for the emotion state and a predetermined probability for the standard emotion state label; and
5.ee
completing training for the emotion state probability detection model in response to satisfaction of the predetermined condition.
5.g
completing training for the emotion state probability detection model in response to satisfaction of the predetermined condition.


 Claims 6, 12, and 18, are rejected on the ground of nonstatutory double patenting as being unpatentable over claims 1, 7 and 13 of U.S. Patent No. US11189302B2, and in further view of Irie (US20090265170A1)

As to claims 6, 12, and 18 the issued patent teaches all of the limitation of claim 1, 7, and 13 respectively, above. However, the issued patent does not disclose  wherein the emotion state transition model is trained by using a Hidden Markov Model (HMM) model But Irie does teach (Irie, Par. 0063:” Then, in sub-step S114, the first statistical model used for calculation of the audio feature appearance probability and the second statistical model used for calculation of the emotional state transition probability are constructed by learning.”, and Par. 0066:”The conditional probability distribution pA[xt|Et] may be created for each possible value of Et using a probability model, such as a normal distribution, a mixed normal distribution and a hidden Markov model [HMM] of the appearance probability of xt. Furthermore, the conditional probability distribution may be created using a different probability model, such as a normal distribution, a multinomial distribution and a mixture thereof, depending on the type of the audio feature. A parameter of the probability model is estimated from the learning audio signal data by a conventional learning method, thereby completing the first statistical model.”, and Par. 0109:” Then, in step S150, based on the audio feature appearance probability and the emotional state transition probability calculated in steps S130 and S140, the emotional state probability is calculated.”, and Par. 0111:” The set of the two statistical models pA[xt|Et] and pB[Et|Et−1] has a structure collectively referred to as generalized state space model and has a causal structure similar to that of the left-to-right hidden Markov model [HMM] often used for audio recognition [the emotional states Et−1 and Et represented by reference symbol St1 and the audio features xt−1 and xt represented by reference symbol St2 shown in FIG. 5, for example].”)
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify US11189302B2 in view of Irie to wherein the emotion state transition model is trained by using a Hidden Markov Model (HMM) model, in order to calculate conditional probability distribution related to frame based emotional state of given utterance, as evidence by Irie (See Par. 0065)

Claim Rejections - 35 USC § 101
35 U.S.C. 101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.


Claims 1-4, 6, 7-10, 12, 13-16, and 18 are rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter without significantly more. The claims as whole, considering all claim elements both individually and in combination, do not amount to significantly more than an abstract idea.
	
	The independent claims 1,7 and 13 recites: “extracting speech features of to-be-detected speech data to form a speech feature matrix corresponding to the to-be-detected speech data; inputting the speech feature matrix to an emotion state probability detection model, the emotion state probability detection model being trained based on a recurrent neural network (RNN) model; based on the speech feature matrix and the RNN model, generating an emotion state probability matrix corresponding to the to-be-detected speech data; generating, based on the emotion state probability matrix, the speech feature matrix, and an emotion state transition model, an emotion state sequence corresponding to the to-be-detected speech data; and determining, based on the emotion state sequence, an emotion state corresponding to the to-be-detected speech data. ”
	The limitation of “extracting speech features”, “inputting the speech feature matrix to an emotion state probability detection model”, “inputting the speech feature matrix to an emotion state probability detection model”, “generating speech feature matrix “, emotion state probability matrix” and “determining an emotion state” as drafted covers a mathematical algorithm (computational) activities, as such they all point to an abstract idea. 
	Extracting speech features which is merely a mathematical process to obtain a matrix of numbers that represent speech. Inputting such matrix into a model which is nothing like a black box that contains an algorithm of some sort to operate on the input and provide some output that represent probability of an emotion state. Generation of such matrix is furthermore a mathematical concept, since matrix is a table of numbers and number manipulation is a mathematical concept. In summary all of the above-mentioned items are mathematical concept which are based on some arithmetic calculations are obtained.
This judicial exception is not integrated into a practical application. In particular, claim 13 recites non-transitory storage medium for storing computer readable instructions, the computer readable instructions, when executed by a processor in a device, causing the processor to perform a method for speech emotion detection as per the independent claim. For example, in Par. 0131 “In an embodiment, the speech emotion detection apparatus provided in this application may be implemented as a form of a computer program. The computer program may be run in the computer device shown in FIG. 16. The non-volatile storage medium of the computer device may store the program logical components forming the speech emotion detection apparatus, for example, the obtaining logical component 1402, the extraction logical component 1404, the output logical component 1406, the emotion state sequence determining logical component 1408, and the emotion state determining logical component 1410 in FIG. 14. The program logical components may cause the computer device to perform the steps in the speech emotion detection method of the embodiments of this application described in this specification. The processor of the computer device can invoke the program logical components of the speech emotion detection apparatus that are stored in the non-volatile storage medium of the computer device, to run corresponding readable instructions, to implement the functions corresponding to the logical components of the speech emotion detection apparatus in this specification. For example, the computer device may obtain to-be-detected speech data by using the obtaining logical component 1402 in the speech emotion detection apparatus shown in FIG. 14; perform framing processing on the to-be-detected speech data to obtain speech frames, and extract speech features corresponding to the speech frames to form a speech feature matrix by using the extraction logical component 1404; input the speech feature matrix to a trained emotion state probability detection model, and output an emotion state probability matrix corresponding to the to-be-detected speech data by using the output logical component 1406; input the emotion state probability matrix and the speech feature matrix to a trained emotion state transition model, to obtain a corresponding emotion state sequence by using the emotion state sequence determining logical component 1408,, where the trained emotion state transition model includes a trained emotion state transition probability parameter; and determine according to the emotion state sequence, an emotion state corresponding to the to-be-detected speech data by using the emotion state determining logical component 1410.” These additional elements do not integrate the abstract idea into a practical application because they do not impose any meaningful limits on practicing the abstract idea–see MPEP 2106.05(f), 2106.04(d). The claim is directed to an abstract idea.
Furthermore, the claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. As discussed above with respect to the integration of the abstract idea into a practical application, the additional element of using a computer which due to lack of specificity it is considered as a general computer (or processor) -see par. 0028 of the Applicant’s Specification “In some embodiments of this application, the terminal 11 may refer to a smart device having a data computing and processing function, including but not limited to, a smartphone (installed with a communications logical component), a palmtop computer, a tablet computer, a personal computer, and the like. The device terminal 11 is installed with an operating system, including but not limited to, an Android operating system, a Symbian operating system, a Windows mobile operating system, an Apple iPhone OS operating system, and the like. The device terminal 11 is installed with various application clients, such as an application client that may acquire speech data.”. Mere instructions to apply an exception using a generic computer component cannot provide an inventive concept. Moreover, the limitation in the claims noted above taken individual or as an ordered set do not amount to significantly more than judicial exception. As such they are directed to an abstract idea as discussed, which performs mathematical concept activity. Thus, neither of the additional elements nor limitations ‘as taken individually or ordered set’ amount to significantly more solution activity. The claims are not patent eligible.
	
Claims 2, 8 and 14 are directed toward mathematical concept. Wherein generating the emotion state probability matrix corresponding to the to-be-detected speech data comprises: obtaining an input layer node sequence according to the speech feature matrix; projecting the input layer node sequence to obtain a hidden layer node sequence for a first hidden layer; and obtaining a hidden layer node sequence for a next hidden layer; successively obtaining, until identifying an output layer, hidden layer node sequences for subsequent hidden layers, respectively; and obtaining an emotion state probability matrix that corresponds to the to-be-detected speech data and that is output by the output layer. Generating the emotion state probability matrix based on input layer sequence and processing the step through multiple layer of a neural network with define connectivity is a mathematical concept which can be carried out with a generic computer. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The claim directed toward abstract idea. The claims are not patent eligible.

	Claim 3, 9 and 15 are directed toward mathematical concept. wherein determining, based on the emotion state sequence, the emotion state corresponding to the to-be-detected speech data further comprises: extracting non-silent speech sub-segments from silent frames of the to-be-detected speech data; and determining, based on the emotion state sequence corresponding to the non-silent speech sub-segments, emotion states corresponding to the non-silent speech sub-segments.  All the activity mentioned above are considered a mathematical concept and algorithm performed on it is directing the claim toward an abstract concept. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The claim directed toward abstract idea. The claims are not patent eligible.

	Claim 4, 10 and 16 are directed toward mathematical concept. wherein the emotion state sequence comprises a silent state, and wherein each of the silent frames corresponds to the silent state. Emotion state sequence calculation based on another state and frame is also a mathematical algorithm which is carried out by a generic computer. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The claim directed toward abstract idea. The claims are not patent eligible.

Claim 6, 12 and 18 are directed toward mathematical concept. wherein the emotion state transition model is trained by using a Hidden Markov Model (HMM) model.  Emotion transition model based on an HMM model is based on a predetermined relationship between the various nodes of such model as such a mathematical concept and algorithm performed on it is directing the claim toward an abstract concept. The claim does not include additional elements that are sufficient to amount to significantly more than the judicial exception. The claim directed toward abstract idea. The claims are not patent eligible.
	
Therefore, claims 1-4, 6, 7-10, 12, 13-16, and 18 are not patent eligible under 35 USC 101.


Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1, 3, 4, 6, 7, 9, 10, 12, 13, 15, 16, and 18  are rejected under 35 U.S.C. 103 as being unpatentable over Irie et al. (US20090265170A1)(herein "Irie"), and in further view of Lee et al. ( Jinkyu Lee and Ivan Tashev; “High-level Feature Representation using Recurrent Neural Network for Speech Emotion Recognition”, Interspeech 2015; Sept 2015; pp. 1537 – 1540)(herein “Lee”).

Regarding claims 1, 7, and 13 Irie teaches [A speech emotion detection method performed by a device, comprising- claim 1], [A speech emotion detection system, comprising a hardware processor, the hardware processor configured to- claim 7], and [A non-transitory storage medium for storing computer readable instructions, the computer readable instructions, when executed by a processor in a device, causing the processor to- claim 13] (Irie, Par. 0190:” A content containing audio signal data that is externally input to an input part 210 shown in FIG. 12 in the form of a digital signal is temporarily stored in a hard disk drive 222 under the control of a central processing unit [CPU] 221, which is a controlling part.”, and Par. 0194:” The CPU 221 shown in FIG. 12 can execute a program that describes the processing functions of the audio feature extracting part 820, the audio feature appearance probability calculating part 830, the emotional state transition probability calculating part 840, the emotional state probability calculating part 850, the emotional state determining part 860 and the content summarizing part 870 of the emotion detecting apparatus 800 shown in FIG. 11 and implement the functions. The program is stored in the hard disk drive 222, for example, and a required program and data is loaded into a random access memory [RAM] 224 for execution. The loaded program is executed by the CPU 221.”).
extracting speech features of to-be-detected speech data to form a speech feature matrix corresponding to the to-be-detected speech data; (Irie, Par. 0018: ... a sequence of a temporal variation characteristic of the power and a temporal variation characteristic of a speech rate from the audio signal data for each analysis frame as an audio feature vector [matrix] and stores the audio feature vector in a storage part”).
generating, based on the emotion state probability matrix, the speech feature matrix, and an emotion state transition model, an emotion state sequence corresponding to the to-be-detected speech data; and (Irie, Par. 0186:” The emotional state probability calculating part 850 calculates the emotional state probability p[Et|{xt}] according to the formulas [5] and [6] based on the appearance probability p[xt|Et] calculated by the audio [speech] feature appearance probability calculating part 830 and the transition probability p[Et|Et−1] calculated by the emotional state transition probability calculating part 840.”, and Par. 0020:”calculates the probability of temporal transition of sequences of one or more emotional states as the emotional state transition probability using a second statistical model;”).
determining, based on the emotion state sequence, an emotion state corresponding to the to-be-detected speech data. (Irie, Par. 0054:” ... the sequence of all the emotional states arranged in descending order of emotional state probability may be determined. The determination may be performed for each section composed of one or more frames ...”, and Par. 0196:” An output part 240 has an additional function of extracting a part in an emotional state of the audio signal data of the input content and outputting a summarized content generated based on the extracted part under the control of a program executed by the CPU 221.”)
inputting the speech feature matrix to an emotion state probability detection model, the emotion state probability detection model being trained [[based on a recurrent neural network (RNN) model]]; (Irie, Par. 0049:" First, step S110 [a statistical model constructing step] is a step that is performed before actual determination of an emotional state in the emotion detecting method according to this embodiment, in which two statistical models used for calculating the emotional state probability [referred to as first statistical model and second statistical model] are constructed. Entities of the statistical models include parameters, such as functions and statistical quantities used for the statistical calculations, described in the form of programs.”, and Par. 0051:"Then, in step S130 [audio feature appearance [matrix] probability calculating step], based on the audio feature vectors [speech feature matrix] calculated and stored in the storage part in step S120, the probability of appearance [matrix] of an audio feature vector corresponding to an emotional state is calculated for each frame using the first statistical model previously constructed in step S110, and the result of the calculation is regarded as the audio feature appearance [matrix] probability.”)
based on the speech feature matrix [[and the RNN model]], generating an emotion state probability matrix corresponding to the to-be-detected speech data; (Irie, Par. 0109:” Then, in step S150, based on the audio feature appearance [matrix] probability and the emotional state transition probability calculated in steps S130 and S140, the emotional state probability [emotion state probability matrix] is calculated.”, and Par. 0118:” According to the method described above, the probability of the emotional state Et is determined by calculation based on the audio feature vector sequence {xt} up to the time t, and therefore, the processing can be performed in real time. On the other hand, if the real time processing is not required, in order to achieve more robust detection, the probability p(Et|{xT}) of the emotional state sequence Et in the case where the audio feature vector sequence {xT} up to the time T [>t] may be obtained is calculated, and the calculated probability may be regarded as the emotional state probability [emotion state probability matrix].”)
Irie fails to explicitly disclose, however, Lee teaches emotional state recognition based on RNN and DNN models (Lee, Figure 1:” depicts a diagram of the conventional speech emotion recognition system based on DNN and ELM. This high-level block diagram contains high-level feature representation which is composed of frame-level feature extraction coupled to a Deep Neural Network [DNN] and further is fed to utterance-level classification which outputs emotional state.” And section 4, 2nd Par.:” For low-level acoustic features, we extract 32 features for every frame: F0 [pitch], voice probability, zero-crossing rate, 12-dimensional Mel-frequency cepstral coefficients [MFCC] with log energy, and their first time derivatives. In the DNN-based framework, we used as a baseline, those 32-dimensional vectors are expanded to 800-dimensional vectors using the context window with the size of 250ms. The network contains 3 hidden layers and each hidden layer has 256 nodes, and the weights were trained by back-propagation algorithm using stochastic gradient descent with mini-batch of 128 samples. In the RNN-based system, the 32-dimensional vectors are directly used for input. The network contains 2 hidden layers with 128 BLSTM cells [64 forward nodes and 64 backward nodes]. Later experiments showed that the performance did not improve with higher number of hidden layers and nodes in both DNN-based and RNN-based systems.) Note: Lee teaches a method by which speech feature matrix input to a RNN/DNN, which then generats an emotion state probability matrix.
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Irie in view of Lee to employ RNN and DNN models, in order to provide efficient learning approach, which allows to account for long contextual effect in emotional speech and the uncertainty of emotional labels, as evidence by Lee (See Conclusion).
 
Regarding claims 3, 9, 15 Irie, teaches wherein determining, based on the emotion state sequence, the emotion state corresponding to the to-be-detected speech data further comprises: extracting non-silent speech sub-segments from silent frames of the to-be-detected speech data; and (Irie, Par. 0135:” the ratio of the speech period is close to the ratio of the speech period [sound period] to the non-speech period [silent period] in a general case.”)
determining, based on the emotion state sequence corresponding to the non-silent speech sub-segments, emotion states corresponding to the non-silent speech sub-segments. (Irie, Par. 0033:” if the input audio signal data is divided into audio sub-paragraphs each including successive speech sections, and a content summary is extracted based on the emotional level of each audio sub-paragraph”, and Par. 0054:”… the emotional state that provides the maximum emotional state probability for each frame may be determined, a predetermined number of emotional states may be determined in descending order of emotional state probability from the emotional state that provides the maximum emotional state probability, or the sequence of all the emotional states arranged in descending order of emotional state probability may be determined. The determination may be performed for each section composed of one or more frames, such as an audio sub-paragraph and an audio paragraph…”)

Regarding claims 4, 10, 16 Lee further teaches wherein the emotion state sequence comprises a silent state, and wherein each of the silent frames corresponds to the silent state. (Lee, Section 3.2:” To represent the uncertainty of emotional labels, in this paper we adopt an additional class for non-emotional frames – Null [silent]. Then, we represent the emotional label as a random variable between two states, one is the given emotion class and the other one is the additional class Null [silent]. Based on this assumption, we design a new training criterion for RNN to maximize the sum of log-probabilities of all possible sequences over the training data. Basically, there are 2T possible sequences, where T is the number of frames in the given utterance. Among them, some sequences can be reasonable, but majority of sequences are not meaningful. For example, it is obvious that silence regions do not contain any emotional information. Thus, it is better to reduce the number of the possible sequences using a prior knowledge. First we divide each utterance into small segments with voiced region, then we assume that the label sequences of each segment follows the Markov chain shown in Figure 3. It means that the sequence from each segment starts from the Null [silent] state and goes through the relevant emotional state and finally goes back to the Null [silent] state. Then, we concatenated the label sequences of each segment to generate the sequences for the entire utterance. To be applicable for continuous emotion recognition, the last state of the current segment is merged with the first state of the next segment.”)
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Irie in view of Lee to wherein the emotion state sequence comprises a silent state, and wherein each of the silent frames corresponds to the silent state, in order to provide efficient learning approach, which allows to account for long contextual effect in emotional speech and the uncertainty of emotional labels, as evidence by Lee (See Conclusion).

Regarding claims 6, 12, and 18 Irie teaches wherein the emotion state transition model is trained by using a Hidden Markov Model (HMM) model. (Irie, Par. 0063:” Then, in sub-step S114, the first statistical model used for calculation of the audio feature appearance probability and the second statistical model used for calculation of the emotional state transition probability are constructed by learning.”, and Par. 0066:”The conditional probability distribution pA[xt|Et] may be created for each possible value of Et using a probability model, such as a normal distribution, a mixed normal distribution and a hidden Markov model [HMM] of the appearance probability of xt. Furthermore, the conditional probability distribution may be created using a different probability model, such as a normal distribution, a multinomial distribution and a mixture thereof, depending on the type of the audio feature. A parameter of the probability model is estimated from the learning audio signal data by a conventional learning method, thereby completing the first statistical model.”, and Par. 0109:” Then, in step S150, based on the audio feature appearance probability and the emotional state transition probability calculated in steps S130 and S140, the emotional state probability is calculated.”, and Par. 0111:” The set of the two statistical models pA[xt|Et] and pB[Et|Et−1] has a structure collectively referred to as generalized state space model and has a causal structure similar to that of the left-to-right hidden Markov model [HMM] often used for audio recognition [the emotional states Et−1 and Et represented by reference symbol St1 and the audio features xt−1 and xt represented by reference symbol St2 shown in FIG. 5, for example].”)



Claims 2, 8, and 14  are rejected under 35 U.S.C. 103 as being unpatentable over Irie, Lee,  and in further view of Wu et al. (US20200154170A1)(herein “Wu”), Li et al. (US 20190043482 A1)(herein “Li”).

Regarding claims 2, 8, and 14 Irie teaches obtaining an emotion state probability matrix that corresponds to the to-be-detected speech data [[and that is output by the output layer]]. (Irie, Par. 0026:” determines the emotional state of a section including the analysis frame based on the emotional state probability;”, and Par. 0027:” outputs information about the determined emotional state.”)
Irie and Lee fail to explicitly disclose, however, Wu teaches emotion state is outputted by the output layer (Wu, Par. 0151:” The recurrent layer may perform recurrent operations on outputs of the convolutional layer. It should be appreciated that, although FIG. 12 shows unidirectional recurrent operations in the recurrent layer, bidirectional recurrent operations may also be applied in the recurrent layer. The recurrent layer may also be referred to as a RNN layer, which may adopt long-short term memory [LSTM] units.”, and Par. 0152:” The output layer may use RNN states from the recurrent layer as feature vectors, and output emotion classification results. For example, the output layer may be a full connection layer that can convert a 256-dimension vector from the recurrent layer to an output of 8-dimension vector which corresponds to 8 types of emotions. In an implementation, the 8 types of emotions include happy, surprise, anger, disgust, sad, contempt, fear and neutral.”)
Since Irie, Lee and Wu are analogous in the art because they are from the same field of endeavor, it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to use the known technique of outputting emotion state by the output layer of neural network in order to yield the predictable result of emotion detection.  One of ordinary skill in the art would have recognized that the results of the combination were predictable when using a DNN or RNN, where the emotion state is outputted from the output layer of the said network using the teachings of Wu would allow for emotion determination which would benefit the emotion determination of Irie, Lee and Wu. See KSR International Co. v. Teleflex Inc., 82 USPQ2d 1385 (U.S. 2007).Irie,Lee, and Wu fail to explicitly disclose, however, Li teaches obtaining an input layer node sequence according to the speech feature matrix; (Li, par. 0102:” The input layer comprises a plurality of input units. The input units are used to calculate an output value output to the bottommost hidden layer according to input speech feature vectors. After the speech feature vectors are input to the input unit, the input unit calculates the output value output to the bottommost hidden layer by using the speech feature vectors input to the input unit according to its own weighted value.”)
projecting the input layer node sequence to obtain a hidden layer node sequence for a first hidden layer; and (Li, Par. 0102:” The input layer comprises a plurality of input units. The input units are used to calculate an output value output to the bottommost hidden layer according to input speech feature vectors [matrix]. After the speech feature vectors are input to the input unit, the input unit calculates the output value output to the bottommost [first] hidden layer by using the speech feature vectors input to the input unit according to its own weighted value.”)
obtaining a hidden layer node sequence for a next hidden layer; (Li, Par. 0101:” The deep neural network comprises an input layer, a plurality of hidden layers, and an output layer. The input layer is used to calculate an output value input to a hidden layer unit of a bottommost [first] layer according to the speech feature vectors input to the deep neural network. The hidden layer is used to, according to a weighted value of the present layer, perform weighted summation for an input value coming from next layer of hidden layer, and calculate an output value output to a preceding layer of hidden layer. The output layer is used to, according to the weighted value of the present layer, perform weighted summation for an output value coming from a hidden layer unit of a topmost layer of hidden layer, and calculate an output probability according to a result of the weighted summation. The output probability is output by the output unit, and represents a probability that the input speech feature vectors are the speech identities corresponding to the output unit.”)
successively obtaining, until identifying an output layer, hidden layer node sequences for subsequent hidden layers, respectively; and (Li, Par. 0101:” The deep neural network comprises an input layer, a plurality of hidden layers, and an output layer. The input layer is used to calculate an output value input to a hidden layer unit of a bottommost [first] layer according to the speech feature vectors input to the deep neural network. The hidden layer is used to, according to a weighted value of the present layer, perform weighted summation for an input value coming from next layer of hidden layer, and calculate an output value output to a preceding layer of hidden layer. The output layer is used to, according to the weighted value of the present layer, perform weighted summation for an output value coming from a hidden layer unit of a topmost layer of hidden layer, and calculate an output probability according to a result of the weighted summation. The output probability is output by the output unit, and represents a probability that the input speech feature vectors are the speech identities corresponding to the output unit.”)
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Irie,  Lee, and Wu in view of Li to obtain an input layer node sequence according to the speech feature matrix; projecting the input layer node sequence to obtain a hidden layer node sequence for a first hidden layer; and obtaining a hidden layer node sequence for a next hidden layer; successively obtaining, until identifying an output layer, hidden layer node sequences for subsequent hidden layers, respectively, in order to substantially improve the recognition performance, as evidence by Li (See Par. 0110).

Claims 5, 11, and 17  are rejected under 35 U.S.C. 103 as being unpatentable over Irie, Lee,  and in further view of Weiming et al. (CN102831184A)(herein “Weiming”), and Natarajan et. al. ("Automated stopping Criteria for Neural Network Training", Proceeding of the American Control Conference, June 1997, New Mexico, Pages 2409 - 2413)(herein “Natarajan”).

Regarding claims 5, 11, and 17 Irie teaches extracting training speech features corresponding to the training speech frames to form a training speech feature matrix; (Irie, Par. 0201:” All the frames included in the labeled sections in the learning audio signal data are extracted, and the frames are labeled the same as the sections from which the frames are extracted.”, and Par. 0070:” … in steps S111 to S113 described above, an audio feature vector x is extracted for each frame of the entire learning audio signal data, …”).
obtaining standard emotion state labels corresponding to training speech frames; (Irie, Par. 0070:” …  and a label indicating the emotional state e of the frame is determined for each frame of the entire learning audio signal data based on actual listening by a person.”, and Par. 0200:” ... a section in the learning audio signal data that is determined to be in the ‘emotional state’ is labeled with ‘emotional state’, and a section in the remaining sections that is determined to be in the ‘non-emotional state’ is labeled with ‘non-emotional state’.”, and Par. 0201:” All the frames included in the labeled sections in the learning audio signal data are extracted, and the frames are labeled the same as the sections from which the frames are extracted.”) Notes: Per as-filed spec. Par. 0073: The standard emotion state label refers to performing standard emotion labeling on the training speech frame with a known emotion state.
training the emotion state probability detection model based on the training speech feature matrix being an input of the emotion state probability detection model and standard emotion state labels corresponding to the training speech features being a predetermined output of the emotion state probability detection model; (Irie, Par. 0025:” ... the audio feature vector for each analysis frame and calculates the emotional state probability on condition of the audio feature vector for sequences of one or more emotional states using one or more statistical models constructed based on previously input learning audio signal data”, Par. 0070:”… an audio feature vector x is extracted for each frame of the entire learning audio signal data, and a label indicating the emotional state e of the frame is determined for each frame of the entire learning audio signal data based on actual listening by a person.”).
Irie and Lee fail to explicitly disclose, however, Weiming teaches determining an error measurement satisfies a predetermined condition, the error measurement based on a probability for the emotion state and a predetermined probability for the standard emotion state labels; and (Weiming, Par. 0065:” This module links to each other with characteristic extracting module; Major function is probability model and the emotion sequence label loss function that makes up the emotion sequence label; Learn out social affection's forecast model, said probability model is mapped to the probability of probability space with said emotion sequence label, and said emotion sequence label loss function characterizes the difference of emotion sequence label with the true emotion sequence label of order models output”).
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Irie, and Lee in view of Weiming to determine an error measurement satisfies a predetermined condition, the error measurement based on a probability for the emotion state and a predetermined probability for the standard emotion state labels, in order to overcome the deficiency of directly utilizing traditional feature selection approach towards text classification (See Par. 0020).
Irie, Lee, and Weiming fail to explicitly disclose, however, Natarajan teaches completing training for the emotion state probability detection model in response to satisfaction of the predetermined condition. (Natarajan, Section1, second Par., P 2497:"At each training epoch the 'learning' NN predicts values for the validation set. Training is stopped when the error on the validation set stops changing any more or starts rising further.
Therefore, it would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to modify Irie, Lee and Weiming in view of Natarajan to complete training for the emotion state probability detection model in response to satisfaction of the predetermined condition, in order to automatically stopping feedforward NN training, as evidence by Natarajan (Section 6, P 2411).


Conclusion
The prior art made of record and not relied upon is considered pertinent to applicant’s disclosure.  Petrushin (US-6353810B1) teaches Col. 1, lines 39 – 47:” A system, method and article of manufacture are provided for comparing user versus computer emotion detection of voice signals. First, a voice signal and an emotion associated therewith are provided. Then, the emotion associated with the voice signal is determined in an automated manner and subsequently stored. Next, a user determined emotion associated with the voice signal is determined by a user and received. The automatically determined emotion with the user determined emotion are then compared.”
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DARIOUSH AGAHI whose telephone number is (408)918-7689. The examiner can normally be reached Monday - Thursday and alternate Fridays, 7:30-4:30 PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/DARIOUSH AGAHI/             Examiner, Art Unit 2656                                                                                                                                                                                           
/Paras D Shah/             Primary Examiner, Art Unit 2659                                                                                                                                                                                           

12/07/2022