Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Information Disclosure Statement
The information disclosure statement (IDS) submitted on 09/24/2020 is being considered by the examiner.
Drawings
The drawing submitted on 06/03/2019 is being considered by the examiner.

Response to Amendment
Claims 1-20 are currently pending in the application and among them claims 1, 5, and 13 are independent claims and claims 1-3, and 5-20 are amended.  
Response to Arguments
Applicant's arguments filed 05/06/2021 have been fully considered but they are moot in view of new ground of rejection.
Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 5 and 13 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Kanai (US 2020/0342896 A1).
Regarding Claims 5 and 13, Kanai teach: A computer-implemented method comprising: receiving input audio data representing a first input; determining first feature vector (acoustic feature amount or vector)representing acoustic features of at least a portion of the input audio data; processing the first feature data using a model (a specific  acoustic feature amount corresponding to an specific emotion, i.e. an acoustic feature amount defined in advance for each emotion) to determine a sentiment (emotion) corresponding to the first input ([0066] In the voice recognition, as illustrated in FIG. 3, after a voice input part 111 receives an input of a voice waveform, a feature amount extraction part 112 extracts a feature amount of the input voice waveform. The feature amount is an acoustic feature amount defined in advance for each emotion, and includes, for example, a pitch (fundamental frequency), loudness (sound pressure level (power)), duration, formant frequency, and spectrum of the voice. The extracted feature amount is passed to a recognition decoder 113. The recognition decoder 113 converts the feature amount into a text using an acoustic model 114 and a language model 115. The recognition decoder 113 uses the acoustic model 114 and the language model 115 corresponding to the recognized emotion. A recognition result output part 116 outputs the text data converted by the recognition decoder 113 as a recognition result.), wherein the model is trained using acoustic feature data (acoustic model) and lexical feature data (language model) (acoustic feature  corresponding predefine emotion which corresponds to an acoustic model and language model data as illustrated in Table 3); and using the first model output data to determine a sentiment corresponding to the first utterance ([0061] For this reason, in this embodiment, both the acoustic model and the language model are created for each emotion by machine learning or deep learning using the neural network. In the learning for creating the acoustic model and the language model, for example, data in which the voices of various emotions of various people are associated with correct texts is used as the teacher data. [0064] The acoustic model and the language model are used corresponding to the emotions in S14 and S15 described above. Specifically, for example, when an emotion of anger is recognized, the acoustic model 1 and the language model 1 are used. [0070] In a first modification of the embodiment (hereinafter, a first modification), emotions are recognized from voices. [0071] …when voice data for one second is collected after the utterance of the speaker, switching to emotion recognition from the voice data is performed. Thus, the voice data of the speaker is collected, and the emotion of the speaker is recognized only from the voice data. Note: It is inherent for predefine feature amount corresponding to an emotion as illustrated in table 3, that are trained by statistical classifier or neural network includes acoustic model and language model. Because accuracy of a user emotional voice recognition and/or conversion to an accurate text depends on processing through an accurate acoustic model and language model corresponding to a specific emotion (see [0009]) and an accurate acoustic model and language model depends on accurate emotion determination which depends on the predefine acoustic feature amount trained by statistical classifier or neural network. Therefore each predefine acoustic feature amounts are inherently corresponds to the each acoustic model and language model (predefine) that are used to trained the statistical classifier or the neural network for determining the predefine emotion illustrated in Fig.3.).

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claims 7 and 15, are rejected under 35 U.S.C. 103 as being unpatentable over Kanai in view of Rose et al.(US 2010/0169159 A1).
Regarding Claims 7 and 15, Kanai teach: The computer-implemented method of claim 5, further comprising: determining text data corresponding to the at least a portion of the input audio data; determining time data indicating when the at least a portion is received by a device; and sending the output data to the device ([0046] In order to perform a text conversion in real time, it is preferable that all the voice recognition models for respective emotions are read from the HDD 14 to the RAM 12 when the conference support program is started. However, if the HDD 14 or other non-volatile memory that stores the voice recognition model can be read at a high speed enough to support real-time subtitle display, the voice recognition model corresponding to the recognized emotion may be read from the HDD 14 or other nonvolatile memories in step S14. [0048] Subsequently, the CPU 11 displays the text of the text data on the display 101 of the first computer 10 as subtitles, and transmits the text data from the communication interface 15 to the second computer 20 (S16). The communication interface 15 serves as an output part when the text data is transmitted to the second computer 20. The second computer 20 displays the text of the received text data on its own display 101 as subtitles.  [0074] In this emotion recognition method, as illustrated in FIG. 4, a low-level descriptors (LLD) is calculated from an input voice. The LLD is a pitch (fundamental frequency), loudness (power), or the like of the voice. Since the LDD is obtained as a time series, various statistics are calculated from the LLD. The statistics are, specifically, an average value, a variance, a slope, a maximum value, a minimum value, and the like. The input voice becomes a feature amount vector by calculating the statistics. The feature amount vector is recognized as an emotion by a statistical classifier or a neural network (estimated emotion illustrated in the drawing). [0087] In this second modification configured in this manner, a conference in which three bases X, Y, and Z are connected is made possible, and the subtitles properly voice-recognized according to the emotion of the speaker are displayed in each of the user terminals 30X, 30Y and 30Z.).
Kanai does not teach: generating output data including the text data, the time data, and an indicator of the sentiment.
However “generating output data including the text data, the time data, and an indicator of the sentiment is well-known in the art.
Such as, Rose et al. teach: generating output data including the text data (communication), the time data, and an indicator of the sentiment; and displaying the output data using the user device ([0046] Once the processor has assigned a sentiment indicator to the communication, the process 200 continues when the processor displays in a graphical user interface the communication, the sentiment indicator associated with the communication, the date and time when the communication was made along with a user identity of the person in the social media (step 230)).
Therefore it would have been obvious to one of ordinary skilled in the art before the effective filling date of the invention was made for Kanai to include the teaching of Rose et al. above in order to display emotion indicative data within a conversation.

Claims 9 and 17 are rejected under 35 U.S.C. 103 as being unpatentable over Kanai in view of Hori et al.(US 2019/0189115 A1).
Regarding Claims 9 and 17, Kanai do not specifically teach: training a bi-directional LSTM using training data to determine the model.
Hori et al. teach: attention based end to end speech recognition with pre-trained BLSTM ([0035] According to embodiments of the present disclosure, it becomes possible to improve the recognition accuracy of end-to-end ASR by introducing the LM module 210. The LM module 210 may include a character-level recurrent neural network (RNN) and a word-level RNN. In some cases, the LM module 210 may be referred to as a hybrid network or a hybrid network module. In this case, the LM module 210 computes the LM probabilities using the character-level LM defined by character LM parameters 211 and the word-level LM defined by word LM parameters 212. The LM module also makes it possible to perform open-vocabulary speech recognition, i.e., even if OOV words are spoken, they are recognized by using the both character-level and word-level LMs. In the decoding process of the present invention, character sequence hypotheses are first scored with the character-level LM probabilities until a word boundary is encountered. Known words are then re-scored using the word-level LM probabilities, while the character-level LM provides LM probability scores for OOV words. [0038] In end-to-end speech recognition, p(Y|X) is computed by a pre-trained neural network without pronunciation lexicon and language model. In the attention-based end-to-end speech recognition of a related art, the neural network consists of an encoder network and a decoder network. [0039] An encoder module 102 includes an encoder network used to convert acoustic feature sequence X=x.sub.1, . . . , x.sub.7-, into hidden vector sequence H=h.sub.1, . . . , h.sub.T as H=Encoder(X), (2) where function Encoder(X) may consist of one or more recurrent neural networks (RNNs), which are stacked. An RNN may be implemented as a Long Short-Term Memory (LSTM), which has an input gate, a forget gate, an output gate and a memory cell in each hidden unit. Another RNN may be a bidirectional RNN (BRNNs) or a bidirectional LSTM (BLSTM). A BLSTM is a pair of LSTM RNNs, one is a forward LSTM and the other is a backward LSTM. A Hidden vector of the BLSTM is obtained as a concatenation of hidden vectors of the forward and backward LSTMs.)
Therefore it would have been obvious to one of ordinary skilled in the art before the effective filling date of the invention was made for Kanai to include the teaching of Hori et al. above in order to improve recognition accuracy.
Allowable Subject Matter
Claims 1-4 are allowed.
The prior arts of record alone or in combination failed to teach, the limitation of Claim 1, specifically the training of emotional model and recognition of emotional category on the second utterance based on the emotional model, “training a first machine learning model, using the acoustic feature vector and the indication, to determine an acoustic-based model configured to detect a sentiment category associated with audio data; training a second machine learning model, using the word embedding feature vector and the indication, to determine a lexical-based model configured to detect a sentiment category associated with text data; determining, using the first acoustic-based model and first loss value associated with the lexical-based model, a combined sentiment detection model; and storing the combined sentiment detection model; and during a second time period after the first time period: receiving input audio data representing a second utterance; determining a first feature vector representing acoustic features of at least a portion of the input audio data; and processing the first feature vector using the combined sentiment detection model to determine a likelihood the second utterance corresponds to a first sentiment category.”
Claims 6, 8, 10-12, 14, 16 and 18-20 are objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Conclusion
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 
The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. The prior art of record Cunico et al.(US 2016/0042226 A1) teach: 0019] In the exemplary embodiment, sentiment analysis program 160 provides a real-time or near real-time sentiment analysis of the attendee in the video conference to a meeting moderator for display on moderator display 125. The attendee's sentiment which may be a graphical representation of the attendee's sentiment, for example, a color coded sentiment bar with a sliding indicator to show an attendee's real -time sentiment as depicted in FIG. 3 may be displayed on moderator display 125. In addition, in an embodiment, sentiment analysis program 160 determines sentiment analysis for each of the attendees in a video conference and compiles an aggregate real-time sentiment of the video conference. The real-time aggregate sentiment representing an average sentiment for all of the video conference attendees may be displayed on moderator display 125 as a graphical representation of the aggregate real-time sentiment. 
The prior art of record Goel et al.(US 2018/0308487 A1) teach: [0032] In an aspect of present invention a system for providing real-time transcripts of spoken text is disclosed. The system comprising: a speech to text engine for converting an input speech of an end user into a text input, the text input comprises one or more sequence of recognized word strings or a word lattice in text form; a semantic engine to receive the text input for producing one or more transcripts using a language model and extracting semantic meanings for said one or more transcript; wherein the semantic engine utilizes a grammar model and the language model to extract meaning for said one or more transcripts.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to MOHAMMAD K ISLAM whose telephone number is (571)270-5878.  The examiner can normally be reached on Monday -Friday, EST (IFP).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Bhavesh Mehta can be reached on 571-272-7453.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.






/MOHAMMAD K ISLAM/Primary Examiner, Art Unit 2656