DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 07/27/2020 was filed on the filing date of the instant application on 07/27/2020.  The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-12 are rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention.
■ Claims 1 and 11-12 are recited limitations “… a plurality of local features from data indicating a speech …”, and “… the plurality of local features …” in the claims. And thus, it is unclear whether “local features” is the same as a speech features or not, local”. There is insufficient support definition(s) for specifying these limitations in the claims, and therefore rendering the claimed invention of the claims unclear. 
Also, even when limitations “… local features from data indicating a speech …” being interpreted as “features of a speech”; the words/language “features of a speech” is vague and indefinite and it not in full, clear, concise, and exact term, the term “features” in the claims is covered a very broad range of attributes/aspects which renders the claims indefinite. The term "local features" is not defined by the claims, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention.
Appropriate correction is required.

■ Claims 1 and 11-12 are recited limitations “… characteristics of feature extraction …”, “… characteristics of encoding the series of chronological features …”, and “… characteristics of weighting the features …”, and “… characteristics of classification …”. The term/word/language “characteristics” in the claims is covered a very broad range of attributes/aspects which renders the claims indefinite. The term "characteristics" is not defined by the claims, the specification does not provide a standard for ascertaining the requisite degree, and one of ordinary skill in the art would not be reasonably apprised of the scope of the invention. And therefore, rendering the claimed invention of the claims unclear.
Appropriate correction is required. 
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1 and 11-12 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Fan et al. (US-PGPUB 2019/0287142 A1 hereinafter “Fan”).

Note: In this OC (Official Correspondence), limitations such as “local features from data indicate a speech”, “characteristics of … features”, inter alia, will be construed as “feature(s) of speech”.
	
As for claims 1 and 11-12, Fan discloses an information processing apparatus (Fig.1, Computing Device 102, Fig.5, Device 500), a method (Figs.1-5 and related description), and a non-transitory computer-readable storage medium(Fig.5, ROM 502, RAM 503, ¶ [9], a computer readable storage medium is provided, storing a computer program thereon, the computer program is executed by a processor, and [101], [105]), comprising: one or more processors (Fig.5, 500, Central Processing Unit (CPU) 501); and a memory (Fig.5, ROM 502, RAM 503) storing instructions which, when the instructions are executed by the one or more processors (Fig.5, ¶ [101], device 500 includes a central processing unit (CPU) 501 performs actions and processing in accordance with program instructions stored in a read only memory (ROM) 502 or computer program instructions loaded into a random access memory (RAM) 503), cause the information processing apparatus to function as: extracting (Fig.4, Feature Extraction Module 420) a plurality of local features from data indicating a speech (Figs.1-5, ¶ [21], extracts features from data indicating a speech such as part-of-speech, emotion, etc.), wherein characteristics of feature extraction are formed through learning (Figs.1-5, ¶ [21], and [47], learning network is used when extracting the features, and characteristics of feature such as determining the degree of importance, etc.); encoding (Fig.3, Convolution Learning Network (CNN) 310) a series of chronological features of the data indicating the speech based on the plurality of local features, wherein characteristics of encoding the series of chronological features are formed through learning (Figs.1-2, Fig.3, Convolution Learning Network (CNN) 310, Figs.4-5, ¶ [50]-[51], encoding a series of chronological features such as vectorized representation of features also referred to as vectorized encodings, ¶ [60]-[64], convolution filters are being used for feature extraction and referred as (Convolution Learning Network) CNN 330, and the semantic feature 332 Q represents the semantic features (or semantic diversity encodings)), and see ¶ [36]-[37], vector encodings (codebook) be obtained by training a specific learning); generating information obtained by weighting features (Fig.3, 300, 332, 342, Combiner 350) at a specific point in time associated with emotion classification, of the series of chronological features encoded (Figs.1-5, ¶ [22], a specific point in time such as frequency of occurrence and frequency-inverse document frequency (TFIDF) “series of chronological”, and emotional features using sentiment dictionary to perform emotional classification, ¶ [68] learning network 300 includes a combiner 350 for weighting the feature 332 using the degree of importance 342), , wherein characteristics of weighting the features at the specific point in time are formed through learning (Figs.1-5, ¶ [66]-[68]); and classifying emotion corresponding to the data indicating the speech using the information obtained by weighting the features at the specific point in time, wherein characteristics of classification are formed through learning (Figs.1-5, ¶ [21]-[22], machine learning classification models using defined features, and emotional features using sentiment dictionary to perform emotional classification, and ¶ [24], [66]-68], and other learning classification models such as the degree of important of features are being used to determine corresponding weight value). 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 2-10 are rejected under 35 U.S.C. 103 as being unpatentable over Fan in view of Zadeh et al. (US-PGPUB 2020/0184278 A1 hereinafter “Zadeh”).

As for claim 2, Fan discloses everything claimed as applied above (see claim 1 above). However, Fan does not explicitly disclose the classification unit further classifies gender. In the same field of communication technology, Zadeh discloses a method for speech recognition that converses digitized speech into feature vectors, and the feature vectors can be used for speaker classifies gender such as male-female identity, etc. (Fig.62, Digitized speech input module …-> Weight assigning module …-> Feature vectors storage, Fig.63, ¶ [1677], and ¶ [2869]).
Since Fan and Zadeh are analogous in the art because they are from the same field of endeavor, and both teach the speech classification. And thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to use the known technique of classifies gender corresponding to the data indicating the speech. One of ordinary skill in the art would have recognized that the results of the combination were predictable since the use of that known technique provides the rationale to arrive at a conclusion of obviousness. See KSR International Co. v. Teleflex Inc., 82 USPQ2d 1385 (U.S. 2007).
	
As for claims 3-6, Fan and Zadeh disclose the classification unit further outputs values indicating probabilities of the classified emotion and gender, wherein the values indicating the probabilities are expressed by probability distributions (Fan – Figs.1-5, ¶ [21]-[22], and [72], and see Zadeh – Figs.1-297, ¶ [1604], [1628], [1677], [2869], 

As for claims 7-10, Fan and Zadeh disclose the encoding unit is realized in a form of a neural network that uses a bidirectional Long Short Term Memory (LSTM), wherein the generation unit is realized in a form of a neural network that uses a self-attention mechanism (Fan - Fig.3, Convolution Learning Network (CNN) 310, ¶ [36]-[37], and [48], vector encodings (codebook) be obtained by training a specific learning, the learning network also be referred to as a neural network, and ¶ [45], attention to the related feature, such a mechanism may also be referred to as a "attention" mechanism, and see Zadeh – Figs.1-297, ¶ [1770], an autoencoder, e.g., a deep autoencoder, ¶ [1877], Long-short-term-memory (LSTM) which a recurrent type neural network is used to model the data in time series, and ¶ [2384], focusing attention on features), and wherein the data indicating the speech is a spectrogram of the speech, and the data indicating the speech is a plurality of spectrograms obtained by dividing the .

Claim 13 is rejected under 35 U.S.C. 103 as being unpatentable over SEO et al. (US-PGPUB 2017/0125020 A1 hereinafter “SEO”, U.S. Patent 9,886,957) in view of Fan, and in view of ARIK et al. (US-PGPUB 2019/0251952 A1 hereinafter ARIK).

As for claim 13, SEO discloses a non-transitory computer-readable storage medium storing a recognition model performed by an information processing apparatus (Figs.1-12, ¶ [28], a computer program stored in a computer-readable recording medium for voice recognition with neural network model), the recognition model comprising: a first layer that performs convolution on an input value that is data indicating a speech, and makes output (Fig.1, Voice Recognition System 100, Voice Input Unit 102, First Voice Recognition Unit 104, ¶ [55], a first layer such as a convolutional layer of the model which performed by first voice recognition unit 104 for extracting features of voice input data 102, and makes output extracted features and ¶ [73], convolution layer: a layer for extracting a convolution feature); a second layer that extracts time-series features of the data indicating the speech with an output value of the first layer as an input value, and makes output (Fig.1, 100, Second Voice Recognition Unit 106, ¶ [56], a second layer such as a forward/backward layer that extracting/learning time-series pattern (or temporal pattern), which performed by second voice recognition unit 106 for extracting a time-series correlation of the features extracted inputted the first voice recognition unit 104, and makes output results, and ¶ [77]); a third layer (Fig.1, 100, Learning Control Unit 110) that generates information obtained by weighting features (Fig.3, Tuned Voice Data – Frequency -> Time) at a specific point in time (Fig.2, Dividing Voice Data – Voice Data -> Windows, Spectrogram On T Windows) associated with classification in an output value of the second layer, with the output value of the second layer as an input value, and makes output (Figs.1-3, Figure 2 depicted voice data “features” being divided into windows at a specific point in time, spectrogram on T windows, and ¶ [31], Figure 3 depicted an example of tuned voice data “features” according to the weighted value, ¶ [32], [68], and ¶ [51], [57], [62]-[63], [66]-[68], a third layer that generates information obtained by weighting features such as a weighted value performs by learning control unit 110, and at specific point in time such as the dividing T windows, using a connectionist temporal classification (CTC) method); and an output layer (Fig.1, Text Output Unit 108) that outputs values indicating probabilities based on an output value of the third layer (Figs.1-12, ¶ [56], represent the result as a probability, and ¶ [60]-[61]).
	SEO also discloses about differentiate/identify speakers based on the speaker of sounds included in the voice data, and thus, gender is inherited in this speaker identification process (SEO – Figs.1-12, ¶ [57], the sounds uttered by a speaker A and the sounds uttered by a speaker B, classified as the label referred to as the candidate label). However, SEO is silent about emotion classification, and using self-attention. In the same field of communication technology, Fan discloses emotional classification (Fan - Figs.1-5, ¶ [21]-[22], machine learning classification models using defined features, and emotional features using sentiment dictionary to perform emotional classification), and attention to the related feature, such a mechanism may also be referred to as a "attention" mechanism (Fan – Figs.1-5, ¶ [45]). And ARIK discloses using self-attention (ARIK – Fig.12, Self Attention 1230, ¶ [106], self-attention mechanism 1230 is used to compute the weights), and gender (ARIK – Fig.1-31, ¶ [159], speaker embedding space learned by the trained speaker encoder, in particular for gender, gender label while training).
Since SEO, Fan and ARIK are analogous in the art because they are from the same field of endeavor, and they teach the speech/voice recognition. And thus, it would have been obvious to one of ordinary skill in the art before the effective filing date of the invention to use the known techniques of emotion classification, classifies gender, and using self-attention. One of ordinary skill in the art would have recognized that the results of the combination were predictable since the use of that known technique provides the rationale to arrive at a conclusion of obviousness. See KSR International Co. v. Teleflex Inc., 82 USPQ2d 1385 (U.S. 2007).

Conclusion
The prior art made of record listed below and more in attached PTO-892 form, and not relied upon is considered pertinent to applicant's disclosure.
LI et al. (US-PGPUB 2022/0005493 A1) disclose detecting audio data includes classification probability data, voiceprint feature data, etc., by a pre-trained network model (see Fig.1).
Sainath et al. (US-PGPUB 2016/0322055 A1) teach processing of audio waveforms for speech recognition using neural network (see Fig.1A).


Any inquiry concerning this communication or earlier communications from the examiner should be directed to KHAI N NGUYEN whose telephone number is (571)270-3141. The examiner can normally be reached IFP.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, AHMAD MATAR can be reached on (571)272-7488. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 

KHAI N. NGUYEN
Primary Examiner
Art Unit 2652



/Khai N. Nguyen/Primary Examiner, Art Unit 2652                                                                                                                                                                                                        
01/26/2022