Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
	Status of Claims
The present application is being examined under the claims filed on 02/21/2020.
Claims 1-20 are rejected.
Claims 1-20 are pending.
Information Disclosure Statement
The information disclosure statement (IDS) was submitted on 0.  The submission is in compliance with the provisions of 37 CFR 1.97.  Accordingly, the information disclosure statement is being considered by the examiner.
Drawings
The drawings were submitted on 02/21/2020 in compliance with all requirements.  Accordingly, they are being considered by the examiner in their entirety.
Specification
The specification was submitted on 02/21/2020 in compliance with all requirements.  Accordingly, it is being considered by the examiner in its entirety.
Claim Rejections - 35 USC § 112(a)
The following is a quotation of the first paragraph of 35 U.S.C. 112(a):
(a) IN GENERAL.—The specification shall contain a written description of the invention, and of the manner and process of making and using it, in such full, clear, concise, and exact terms as to enable any person skilled in the art to which it pertains, or with which it is most nearly connected, to make and use the same, and shall set forth the best mode contemplated by the inventor or joint inventor of carrying out the invention.

Claims 2-6, and 16-20 are rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim(s) contains subject 

Regarding Claim 2:
Claim 2 is rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention.  Specifically, “a second 
video processing part of the machine learning system”, “a second acoustic processing part of the machine learning system”, and “a second textual processing part of the machine learning system” all describe that there is a separate part, section, or algorithm, of the machine learning system which should be applied to the second utterance representation data.  However, the examiner is unable to ascertain where this second part of the machine learning system is described in the specification.  It appears that the second processing parts of the machine learning system is a second processing iteration of the machine learning system, and it is interpreted as such under the following 103 rejections.  However, as written, it lacks sufficient description if these are meant to be a separate processing part and not a separate processing iteration.

Regarding Claims 3-6:
Claims 3-6 are also rejected under 112(a) as they inherit the limitations from claim 2 that lack sufficient written description.

Regarding Claim 16:
Claim 16 is rejected under 35 U.S.C. 112(a) or 35 U.S.C. 112 (pre-AIA ), first paragraph, as failing to comply with the written description requirement. The claim contains subject matter which was not described in the specification in such a way as to reasonably convey to one skilled in the relevant art that the inventor or a joint inventor, or for applications subject to pre-AIA  35 U.S.C. 112, the inventor(s), at the time the application was filed, had possession of the claimed invention.  Specifically, “a second video processing part of the machine learning system”, and “a second acoustic processing part of the machine learning system”, both describe that there is a separate part, section, or algorithm, of the machine learning system which should be applied to the second utterance representation data.  However, the examiner is unable to ascertain where this second part of the machine learning system is described in the specification.  It appears that the second processing parts of the machine learning system is a second processing iteration of the machine learning system, and it is interpreted as such under the following 103 rejections.  However, as written, it lacks sufficient description if these are meant to be a separate processing part and not a separate processing iteration.

Regarding Claims 17-20:
Claims 17-20 are also rejected under 112(a) as they inherit the limitations from claim 16 that lack sufficient written description.


Claim Rejections - 35 USC § 112(b)
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


Claims 2-6, 8-14, and 16-20 are rejected under 35 U.S.C. 112(b) as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor regards as the invention. 

Regarding Claim 2:
There is insufficient antecedent basis for the following limitations: “the video data of the second utterance representation”, “the acoustic data of the second utterance representation”, and “the text data of the second utterance representation".  

Regarding Claims 3-6:
These claims are also rejected under 112(b) as they inherit the indefiniteness issues from their respective parent claims.

Regarding Claim 8:
There is insufficient antecedent basis for the following limitation: “the first utterance output”.  

Regarding Claim 9:
There is insufficient antecedent basis for the following limitations: “the video data of the second utterance representation”, “the acoustic data of the second utterance representation”, “the text data of the second utterance representation”.  

Regarding Claims 10-14:
These claims are also rejected under 112(b) as they inherit the indefiniteness issues from their respective parent claims.

Regarding Claim 16:
There is insufficient antecedent basis for the following limitations: “the video data of the second utterance representation”, and “the acoustic data of the second utterance representation”.  

Regarding Claims 17-20:
These claims are also rejected under 112(b) as they inherit the indefiniteness issues from their respective parent claims.

Claim Rejections - 35 USC § 101
35 U.S.C.  101 reads as follows:
Whoever invents or discovers any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof, may obtain a patent therefor, subject to the conditions and requirements of this title.

Claims 15 – 20 are rejected under 35 U.S.C.  101 because the claimed invention is directed to  an abstract idea without significantly more.

Regarding Claim 15:
Claim 15 is rejected under 35 U.S.C. 101 because the claimed invention is directed to non-statutory subject matter.  


Regarding Claims 16-20:
Dependent claims 16-20 are also rejected under 35 U.S.C. §101 for being directed to data signals per se for the same reasons as claim 15.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:


(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

Claims 1, 8, and 15 are rejected under 35 U.S.C. 102(a)(2) as being anticipated by Cheng et al. (US 10679063 B2), hereinafter Cheng.  
Regarding Claim 1:
Cheng teaches a method to generate a conversation analysis, the method comprising: 
receiving multiple utterance representations, (“multimedia input 102, including audio and text, in addition to the more typical visual features” (Cheng, Fig. 2 and related text, 11:61-62).  “the audio feature detection module 214 extracts speech features, such as prosodic features, from the audio signal” (Cheng 12:14-16).  Further, Cheng discloses that “the extracted audio features may individually or collectively be used to detect audio concepts and/or to identify individual speakers” and that “techniques involve extracting frame-wise low-level descriptors” and to “aggregate these features over the utterance level” (Cheng, 12:21-24,39-42).)
wherein each utterance representation represents a portion of a conversation performed by at least two users, and wherein each utterance representation is associated with video data, acoustic
data, and text data; and  (Cheng’s learning-based multimodal analysis system and method discloses “multimedia input 102, including audio and text, in addition to the more typical visual features” (Cheng, Fig. 2 and related text, 11:61-62).  Further, this video, acoustic, and text data are associated with utterances which include portions of a conversation performed by at least two users, since Cheng discloses that “the extracted audio features may individually or collectively be used to detect audio concepts and/or to identify individual speakers” and that “techniques involve extracting frame-wise low-level descriptors” and to “aggregate these features over the utterance level” and (Cheng, 12:21-24,39-42).  Cheng also discloses that these three media forms are associated as such “Any video of the input 
generating a first utterance output by applying a first utterance representation, that is associated with a first user and that is of the multiple utterance representations, to a machine learning system, wherein generating the first utterance output includes:  (Cheng’s multimedia data, segmented by association with utterances of a conversation, is processed through a multimedia content understanding module 104, in which a machine learning system processes a first utterance representation, and includes a “salient event learning module 158, and a template learning module 160.  The learning modules 152 execute machine learning algorithms” (Cheng, Fig. 1 and 2 and related text, 7:44-49).  After processing through the multimedia content understanding module, these utterance representations are sent to the output generator module 114.  Before being set to the output generator module.  Cheng is thereby generating utterance outputs, including a first utterance output, by applying utterance representations (including a first utterance representation), associated with two or more users (including a first user), to a machine learning system in order to generate utterance outputs (including a first utterance output).)
applying video data of the first utterance representation to a first video processing part of the machine learning system to generate first video-based output;  (“The visual feature detection module  execute machine learning algorithms on samples of multimedia content (images and/or video) of an image/video collection” (Cheng, Fig. 2 and related text, 7:44-50).  Since the learning modules are located in the multimedia content understanding module, Cheng is applying video data of the first segment of video features (first utterance representation) to a first video processing part of the machine learning system to generate first video-based output.)
applying acoustic data of the first utterance representation to a first acoustic processing part of the machine learning system to generate first acoustic-based output;  (“example, the audio feature detection module 214 may detect an acoustic characteristic of the audio track of a certain segment of an  input video 102, and, with the audio feature model (which may be trained using machine learning techniques)” (Cheng, Fig. 2 and related text, 12:2-6).  Cheng is thereby applying acoustic data of the first multimedia segment (utterance representation) to a first acoustic processing part of the machine learning system to generate first acoustic-based output.)
applying text data of the first utterance representation to a first textual processing part of the machine learning system to generate first text-based output; and  (Cheng’s system “can detect the presence of a variety of different types of multimedia features in the multimedia input 102, including audio and text […] Any or all of these features may be detected using detectors that are trained via machine-learning techniques” (Cheng, 11:60-65).  Cheng discloses “The models 134, 136 correlate semantic descriptions of audio, visual, text, etc. features and concepts with instances or combinations of output of the algorithms 130 that evidence those features and concepts. For example, the feature models 134 may define relationships between sets of low-level features detected by the algorithms 130 
generating the first utterance output by combining data that is based on the first video-based output, the first acoustic-based output, and the first text-based output.  (“The output generator module 114 and its submodules, a visual presentation generator module 116 and a natural language generator module 118, are each embodied as software, firmware, hardware, or a combination thereof.” (Cheng, Fig. 2 and related text, 6:35-39).  Further, Cheng discloses “presentation templates 142 provide the specifications that the output generator module 114 uses”, including “the order in which to arrange the segments”, “the accompanying audio or text, and/or other aspects of the visual presentation 120 […] e.g., chronological” (Cheng, Fig. 2 and related text, 7:16-27).  Cheng is thereby generating the first utterance output in a chronological sequence, by combining data that is based on the first video-based output, the first acoustic-based output, and the first text-based output.)


Regarding Claim 8:
Cheng teaches a computing system for generating conversation analysis indicators for a conversation performed by at least two users, the computing system comprising: 
one or more processors;  (“Embodiments may also be implemented as instructions stored using one or more machine-readable media, which may be read and executed by one or more processors” (Cheng, 25:44-47).)
and one or more memories storing computer-executable instructions that, when executed by the one or more processors, cause the computing system to perform operations comprising:  (“Embodiments may also be implemented as instructions stored using one or more machine-readable media, which may be read and executed by one or more processors” (Cheng, 25:44-47).)
applying video data of a first utterance representation from the conversation to a first video processing part of the machine learning system to generate first video-based output;  (“The visual feature detection module 212 analyses each segment 204 using the visual feature models 236, and outputs a set of virtual features 220 that have been detected in the segment 204. To do this, the visual feature detection module 212 employs a number of automated feature recognition algorithms” (Cheng, Figure 2 and related text, 10:52-57).   Further, “The learning modules 152 execute machine learning algorithms on samples of multimedia content (images and/or video) of an image/video collection” (Cheng, Fig. 2 and related text, 7:48-50).  Since the learning modules are located in the multimedia content understanding module, Cheng is applying video data of the first segment of video features (first utterance representation) to a first video processing part of the machine learning system to generate first video-based output. )
applying acoustic data of the first utterance representation to a first acoustic processing part of the machine learning system to generate first acoustic-based output;  (“example, the audio feature detection module 214 may detect an acoustic characteristic of the audio track of a certain segment of an  , and, with the audio feature model (which may be trained using machine learning techniques)” (Cheng, Fig. 2 and related text, 12:2-6).  Cheng is thereby applying acoustic data of the first multimedia segment (utterance representation) to a first acoustic processing part of the machine learning system to generate first acoustic-based output.)
applying text data of the first utterance representation to a first textual processing part of the machine learning system to generate first text-based output; and  (Cheng’s system “can detect the presence of a variety of different types of multimedia features in the multimedia input 102, including audio and text […] Any or all of these features may be detected using detectors that are trained via machine-learning techniques” (Cheng, 11:60-65).  Cheng discloses “The models 134, 136 correlate semantic descriptions of audio, visual, text, etc. features and concepts with instances or combinations of output of the algorithms 130 that evidence those features and concepts. For example, the feature models 134 may define relationships between sets of low-level features detected by the algorithms 130 with semantic descriptions of those sets of audio, visual, text, etc. features ( e.g., visual concept descriptions such as "object," "person," "face," "ball," "vehicle," and audio concept descriptions such as "happy", "annoyed," "excited," "calm," etc.)” (Cheng, Fig. 1, 2, 5 and related text, 5:11-20).  Cheng’s learning-based multimodal analysis system discloses three media modes, and to “aggregate these features over the utterance level”, further recommending “the extracted audio features may individually or collectively be used to detect audio concepts and/or to identify individual speakers” (Cheng, 12:23-24,40-42).  Cheng is thereby associating the input text media’s features with other medias which help provide semantic descriptions of the conversation’s emotional content of those sets of data.  Cheng is thereby using text data to represent utterances which may include portions of a conversation performed by at least two users, and applying it to machine learning feature detection processing to generate an output of semantic descriptions.
generating the first utterance output by combining data that is based on the first video-based output, the first acoustic-based output, and the first text-based output.  (“The output generator module 114 and its submodules, a visual presentation generator module 116 and a natural language generator module 118, are each embodied as software, firmware, hardware, or a combination thereof.” (Cheng, Fig. 2 and related text, 6:35-39).  Further, Cheng discloses “presentation templates 142 provide the specifications that the output generator module 114 uses”, including “the order in which to arrange the segments”, “the accompanying audio or text, and/or other aspects of the visual presentation 120 […] e.g., chronological” (Cheng, Fig. 2 and related text, 7:25-27).  Cheng is thereby generating the first utterance output in a chronological sequence, by combining data that is based on the first video-based output, the first acoustic-based output, and the first text-based output.)

Regarding Claim 15:
Cheng teaches a computer-readable storage medium storing instructions that, when executed by one or more processors of a computing system, cause the computing system to perform actions comprising:  (“Embodiments may also be implemented as instructions stored using one or more machine-readable media, which may be read and executed by one or more processors” (Cheng, 25:44-47).  In this case, Cheng is storing computer processor instructions on (computer-readable) storage media, which causes the computer system to perform actions.)
receiving a first utterance representation from a conversation performed by at least two users;  (Cheng’s video, acoustic, and text data are associated with utterances which include portions of a conversation performed by at least two users, since Cheng discloses that “the extracted audio features may individually or collectively be used to detect audio concepts and/or to identify individual speakers” and that “techniques involve extracting frame-wise low-level descriptors” and to “aggregate these features over the utterance level” and (Cheng, 12:21-24,39-42).  Cheng also discloses that these three 
generating a first utterance output by applying the first utterance representation to a machine learning system, wherein the generating the first utterance output includes:  (Cheng’s multimedia data, segmented by association with utterances of a conversation, is processed through a multimedia content understanding module 104, in which a machine learning system processes a first utterance representation, and includes a “salient event learning module 158, and a template learning module 160.  The learning modules 152 execute machine learning algorithms” (Cheng, Fig. 1 and 2 and related text, 7:47-49).  After processing through the multimedia content understanding module, these utterance representations are sent to the output generator module 114.  Before being set to the output generator module.  Cheng discloses “presentation templates 142 provide the specifications that the output generator module 114 uses”, including “the order in which to arrange the segments”, “the accompanying audio or text, and/or other aspects of the visual presentation 120 […] e.g., chronological” (Cheng, Fig. 2 and related text, 7:25-27).  Cheng is thereby generating utterance outputs, including a first utterance output, by applying utterance representations (including a first utterance representation), to a machine learning system in order to generate a first utterance output.   
applying video data of the first utterance representation to a first video processing part of the machine learning system to generate first video-based output;  (“The visual feature detection module 212 analyses each segment 204 using the visual feature models 236, and outputs a set of virtual features 220 that have been detected in the segment 204. To do this, the visual feature detection module 212 employs a number of automated feature recognition algorithms” (Cheng, Figure 2 and related text, 10:52-57).   Further, “The learning modules 152 execute machine learning algorithms on samples of multimedia content (images and/or video) of an image/video collection” (Cheng, Fig. 2 and related text, 7:48-50).  Since the learning modules are located in the multimedia content understanding module, Cheng is applying video data of the first segment of video features (first utterance representation) to a first video processing part of the machine learning system to generate first video-based output. )
applying acoustic data of the first utterance representation to a first acoustic processing part of the machine learning system to generate first acoustic-based output;  (“example, the audio feature detection module 214 may detect an acoustic characteristic of the audio track of a certain segment of an  input video 102, and, with the audio feature model (which may be trained using machine learning techniques)” (Cheng, Fig. 2 and related text, 12:2-6).  Cheng is thereby applying acoustic data of the first multimedia segment (utterance representation) to a first acoustic processing part of the machine learning system to generate first acoustic-based output.)
generating the first utterance output by combining data that is based on the first video-based output and the first acoustic-based output.  (“The output generator module 114 and its submodules, a visual presentation generator module 116 and a natural language generator module 118, are each embodied as software, firmware, hardware, or a combination thereof.” (Cheng, Fig. 2 and related text, 6:35-39).  Further, Cheng discloses “presentation templates 142 provide the specifications that the output generator module 114 uses”, including “the order in which to arrange the segments”, “the accompanying audio or text, and/or other aspects of the visual presentation 120 […] e.g., chronological” 

Claim Rejections - 35 USC § 103
The following is a quotation of 35 U.S.C.  103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.

Claims 2-7, 9-14, and 16-20 are rejected under 35 U.S.C.  103 as being unpatentable over Cheng et al. (US10679063), hereinafter Cheng, in view of Wooters et al. (US 2021/0158812 A1), hereinafter Wooters.

Regarding Claim 2:
Cheng teaches:
The method of claim 1 further comprising: generating a second utterance output by applying a 
second utterance representation, of the multiple utterance representations, to the machine learning system, wherein the second utterance representation is associated with a second user and corresponds to a first time window that also corresponds to the first utterance representation, and wherein generating the second utterance output includes:  (Cheng’s multimedia data, segmented by association with utterances of a conversation, is processed through a multimedia content understanding module 104, in which a machine learning system processes a multiple utterance representations, and includes a “salient event learning module 158, and a template learning module 160.  The learning modules 152 
applying the video data of the second utterance representation to a second video processing part of the machine learning system to generate second video-based output,  (“The visual feature detection module 212 analyses each segment 204 using the visual feature models 236, and outputs a set of virtual features 220 that have been detected in the segment 204.  To do this, the visual feature detection module 212 employs a number of automated feature recognition algorithms” (Cheng, Figure 2 and related text, 10:52-57).  Since Cheng directs to “aggregate these features over the utterance level” (Cheng, 12:24) and is associating the time windows of input text media’s features with other medias to provide semantic descriptions of the conversations segments, as noted in claim 1, we know the utterance level is initially multimodal and not yet defined as being an individual speaker.  Further, “The learning modules 152 execute machine learning algorithms on samples of multimedia content (images and/or video) of an image/video collection” (Cheng, Fig. 2 and related text, 7:48-50).  Since the learning modules are located in the multimedia content understanding module before the media data is sent to the output generator, Cheng is applying video data of the second segment of video features (second 
applying the acoustic data of the second utterance representation to a second acoustic processing part of the machine learning system to generate second acoustic-based output,  (“example, the audio feature detection module 214 may detect an acoustic characteristic of the audio track of a certain segment of an  input video 102, and, with the audio feature model (which may be trained using machine learning techniques)”, (Cheng, Fig. 2 and related text, 12:2-6).  Since Cheng directs to “aggregate these features over the utterance level” (Cheng, 12:24) and is associating the time windows of input text media’s features with other medias to provide semantic descriptions of the conversations segments, as noted in claim 1, we know the utterance level is initially multimodal and not yet defined as being an individual speaker.  Cheng is thereby applying a second utterance representation, that is associated with a second user, to a second part of the machine learning system, to generate a second utterance output.)
applying the text data of the second utterance representation to a second textual processing part of the machine learning system to generate second text-based output, and (Cheng’s system “can detect the presence of a variety of different types of multimedia features in the multimedia input 102, including audio and text, in addition to the more typical visual features […] Any or all of these features may be detected using detectors that are trained via machine-learning techniques“ (Cheng, 11:60-65).  “The text feature detection module 216 interfaces with an automated speech recognition (ASR) system and/or a video optical character recognition (OCR) system” (Cheng, Fig. 2 and related text, 13:5-7).  Since Cheng directs to “aggregate these features over the utterance level” (Cheng, 12:24) and is associating the time windows of multiple media’s features to provide semantic descriptions of the conversations segments, as noted in claim 1, we know the utterance level defined here is initially multimodal and not yet defined as being an individual speaker.  Cheng later further discloses that “the 
generating the second utterance output by combining data that is based on the second video-based output, the second acoustic-based output, and the second text-based output; and   (“The output generator module 114 and its submodules, a visual presentation generator module 116 and a natural language generator module 118, are each embodied as software, firmware, hardware, or a combination thereof.” (Cheng, Fig. 2 and related text, 6:35-39).  Further, Cheng discloses “presentation templates 142 provide the specifications that the output generator module 114 uses”, further specifying “the order in which to arrange the segments”, “the accompanying audio or text, and/or other aspects of the visual presentation 120”. (Cheng, Fig. 2 and related text, 7:25-27).  Cheng is thereby generating the second utterance output in a chronological sequence, by combining data that is based on the second video-based output, the second acoustic-based output, and the second text-based output.)
[…] combining the first utterance output and the second utterance output.  (Since Cheng explicitly states that “the extracted audio features may individually or collectively be used to detect audio concepts and/or to identify individual speakers” these identified individual speakers’ feature data (first utterance output and the second utterance output), together, have generated first combined speaker features for the duration in which these two utterances occurred (first time window).
Cheng does not explicitly teach: generating first combined speaker features for the first time window […]
However, in an analogous art of recognizing events with conversational audio analytics, Wooters teaches: generating first combined speaker features for the first time window […]  (Wooters discloses  may include utterances by the primary speaker that are not addressed at the conversational computing interface, for example, as with user utterance 108C3 responding to the barista's question posed in other speaker utterance 108C2.  Nevertheless, a conversational computing interface according to the present disclosure is configured to delineate turns in multi-speaker situations, of which the conversation depicted in FIG. 3B is one non-limiting example.  For example, a previously-trained model of the conversational computing interface may be trained on conversation histories in which a user's turn is interrupted, such as conversation history 106" depicted in FIG. 3B, thereby configuring the previously-trained model to recognize similar situations” (Wooters, Fig. 3B and related text, ¶0070-0071).   By training conversational computing models on turns (time windows) containing features of multiple speakers, Wooters is generating first combined speaker features for the first user turn (first time window) in segmented conversations.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the multimodal analysis system, of Cheng, with the combined speaker utterance conversation segmentation of Wooters, in order to improve Cheng’s ability to “detect the presence of and recognize various elements of the visual scenes that are depicted” when it comes to multiple speakers conversing over video (Cheng, 1:38-39).
 
Regarding Claim 3:
The method of claim 2 further comprising: applying, to a sequential part of the machine learning system, the first […] speaker features, and an output, from the sequential part of the machine learning system, that was generated by the sequential part of the machine learning system in response to the sequential part of the machine learning system receiving at least second […] speaker features for a second time window prior to the first time window; and  (Cheng disclosed “learning modules 152, including feature learning modules 154, a concept learning module 156, a salient event learning module 
	generating, by the sequential part of the machine learning system, one or more conversation analysis indicators in response to the receiving the first […] speaker features and the output for the second time window.  (Cheng discloses a saliency indicator 238 which is a resulting output after the feature detection modules and event detection modules consider multiple sequential machine learning models in the multimedia content understanding module 104 (Cheng, Fig. 1 and 2 and related text, 17:48-50).  Further, Cheng discloses using saliency indicators (conversation analysis indicators) to prioritize salient event segments for presentation processing at the output generator module stage which is after the sequential machine learning system processing of first [combined] speaker features (Cheng, Fig. 1, 2, 3, and related text, 17:64-18:3).  So, Cheng is using audio, visual, and text media from a first time window and a second time window before it, to generate individually for each, combined speaker features, and after that, a conversation analysis indicator.
Cheng does not explicitly teach: […] combined speaker features, […] second combined speaker features […] 
[…] first combined speaker features
applying […] combined speaker features, […] second combined speaker features […]  (Wooters discloses that “a user turn 108C may include utterances […] responding to the barista's question posed in other speaker utterance 108C2.  Nevertheless, a conversational computing interface according to the present disclosure is configured to delineate turns in multi-speaker situations, of which the conversation depicted in FIG. 3B is one non-limiting example.  For example, a previously-trained model of the conversational computing interface may be trained on conversation histories in which a user's turn is interrupted, such as conversation history 106" depicted in FIG. 3B, thereby configuring the previously-trained model to recognize similar situations” (Wooters, Fig. 3B and related text, ¶0070-0071).   Wooters is segmenting conversations into user turns and processing utterances from multiple users per user turn (time window), thereby generating combined speaker features for each consecutive user turn (second combined speaker features).
[…] first combined speaker features  (Wooters discloses that “a user turn 108C may include utterances […] responding to the barista's question posed in other speaker utterance 108C2.  Nevertheless, a conversational computing interface according to the present disclosure is configured to delineate turns in multi-speaker situations, of which the conversation depicted in FIG. 3B is one non-limiting example.  For example, a previously-trained model of the conversational computing interface may be trained on conversation histories in which a user's turn is interrupted, such as conversation history 106" depicted in FIG. 3B, thereby configuring the previously-trained model to recognize similar situations” (Wooters, Fig. 3B and related text, ¶0070-0071).   Wooters is segmenting conversations into user turns and processing utterances from multiple users per user turn (time window), thereby generating combined speaker features for each consecutive user turn (including first combined speaker features).


Regarding Claim 4:
	The method of claim 3, wherein the one or more conversation analysis indicators include a set of emotional scores.   (Cheng discloses “The models 134, 136 correlate semantic descriptions of audio, visual, text, etc. features and concepts with instances or combinations of output of the algorithms 130 that evidence those features and concepts. For example, the feature models 134 may define relationships between sets of low-level features detected by the algorithms 130 with semantic descriptions of those sets of audio, visual, text, etc. features (e.g., visual concept descriptions such as "object," "person," "face," "ball," "vehicle," and audio concept descriptions such as "happy", "annoyed," "excited," "calm," etc.)” (Cheng, Figs. 1, 2, 5 and related text, 5:11-20).  Cheng is thereby associating the input text media’s features with other medias which help provide semantic descriptions of the conversation’s emotional content of those sets of data.   

Regarding Claim 5:
	The method of claim 3, wherein the one or more conversation analysis indicators include at least one confidence score for at least one emotional label.  (Cheng disclosed semantic “audio concept descriptions such as “happy,” “annoyed,” “excited,” “calm,” etc.” (Cheng, Fig, 5:11-20).  Further, Cheng states that “Each or any of the models 134, 136 and/or the mapping 140 can maintain (e.g., probabilistic or statistical) indicators of the determined evidentiary significance of the relationships between 

Regarding Claim 6:
	The method of claim 3 further comprising: storing a series of sets of conversation analysis indicators, each set of conversation analysis indicators corresponding to a segment of the conversation;  (Cheng discloses “saliency indicators 238 (FIG. 2), which indicate, for particular salient activities, a variable degree of saliency associated with the activity as it relates to a particular event” (Cheng, Fig. 2 and related text, 5:54:57).  Further, Cheng discloses “at block 332, the system 100 identifies the salient event segments 112 in the multimedia input file(s)” (Cheng, Fig. 3 and related text, 17:64-65).  These sets of saliency (conversation analysis) indicators correspond to segments are stored either on “the server computing device 650 may operate a “back end” portion 658 of the multimedia content assistant 
	wherein the sets of conversation analysis indicators correspond to segments of the conversation that represent the entire conversation.  (Cheng discloses the salient event segments are parts of complex events which are made up of activities “examples of complex events include human interactions with other people (e.g., conversations, meetings, presentations, etc.) […] activities that make up a complex event are not limited to visual features.  Rather, “activities” as used herein may refer to, among other things, visual, audio, and/or text features, which may be detected by the computing system 100 in an automated fashion using a number of different algorithms and feature detection techniques as described in more detail” (Cheng, 3:41-51).  Since Cheng is automatically segmenting conversations into all constituent audio, visual, and text features, he is making it so the sets of conversation analysis indicators correspond to segments of the conversation that represent the entire conversation.

Regarding Claim 7:
The method of claim 1, wherein the first video processing part of the machine learning system is a recurrent neural network, the first acoustic processing part is a convolutional neural network, and the first textual processing part is a convolutional neural network.  (Wooters discloses “Machines may be implemented using any suitable combination of state-of-the-art and/or future ML, AI, statistical, and/or NLP techniques,” and provides non-limiting examples of “multi-layer neural networks, convolutional neural networks (e.g., including spatial convolutional neural networks for processing images and/or videos, temporal convolutional neural networks for processing audio signals and/or natural language sentences, and/or any other suitable convolutional neural networks configured to convolve and pool features across one or more temporal and/or spatial dimensions), recurrent neural networks (e.g.,”  
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the multimodal analysis system, of Cheng, with the combined speaker utterance conversation segmentation of Wooters, in order to improve Cheng’s ability to “detect the presence of and recognize various elements of the visual scenes that are depicted” when it comes to multiple speakers conversing over video (Cheng, 1:38-39).

Regarding Claim 9:
Cheng teaches the computing system of claim 8 wherein the operations further comprise:
generating a second utterance output by applying a second utterance representation from the 
conversation to the machine learning system, wherein the second utterance representation is associated with a second user and corresponds to a first time window that also corresponds to the first utterance representation, by:  (Cheng’s multimedia data, segmented by association with utterances of a conversation, is processed through a multimedia content understanding module 104, in which a machine learning system processes a multiple utterance representations, and includes a “salient event learning module 158, and a template learning module 160.  The learning modules 152 execute machine learning algorithms” (Cheng, Fig. 1 and 2 and related text, 7:47-49).  After processing through the multimedia content understanding module, these utterance representations are sent to the output generator module 114 before being set to the output generator module.  Since Cheng discloses “the 
applying the video data of the second utterance representation to a second video processing part of the machine learning system to generate second video-based output,  (“The visual feature detection module 212 analyses each segment 204 using the visual feature models 236, and outputs a set of virtual features 220 that have been detected in the segment 204.  To do this, the visual feature detection module 212 employs a number of automated feature recognition algorithms” (Cheng, Figure 2 and related text, 10:52-57).  Since Cheng directs to “aggregate these features over the utterance level” (Cheng, 12:24) and is associating the time windows of input text media’s features with other medias to provide semantic descriptions of the conversations segments, as noted in claim 1, we know the utterance level is initially multimodal and not yet defined as being an individual speaker.  Further, “The learning modules 152 execute machine learning algorithms on samples of multimedia content (images and/or video) of an image/video collection” (Cheng, Fig. 2 and related text, 7:48-50).  Since the learning modules are located in the multimedia content understanding module before the media data is sent to the output generator, Cheng is applying video data of the second segment of video features (second utterance representation) to a second video processing part of the machine learning system to generate second video-based output.)
applying the acoustic data of the second utterance representation to a second acoustic processing part of the machine learning system to generate second acoustic-based output,  (“example, the audio feature detection module 214 may detect an acoustic characteristic of the audio track of a certain segment of an  input video 102, and, with the audio feature model (which may be trained using machine learning techniques)”, (Cheng, Fig. 2 and related text, 12:2-6).  Since Cheng directs to “aggregate these features over the utterance level” (Cheng, 12:24) and is associating the time windows of input text media’s features with other medias to provide semantic descriptions of the conversations segments, as noted in claim 1, we know the utterance level is initially multimodal and not yet defined as being an individual speaker.  Cheng is thereby applying a second utterance representation, that is associated with a second user, to a second part of the machine learning system, to generate a second utterance output.)
applying the text data of the second utterance representation to a second textual processing part of the machine learning system to generate second text-based output, and (Cheng’s system “can detect the presence of a variety of different types of multimedia features in the multimedia input 102, including audio and text, in addition to the more typical visual features […] Any or all of these features may be detected using detectors that are trained via machine-learning techniques“ (Cheng, 11:60-65).  “The text feature detection module 216 interfaces with an automated speech recognition (ASR) system and/or a video optical character recognition (OCR) system” (Cheng, Fig. 2 and related text, 13:5-7).  Since Cheng directs to “aggregate these features over the utterance level” (Cheng, 12:24) and is associating the time windows of multiple media’s features to provide semantic descriptions of the conversations segments, as noted in claim 1, we know the utterance level defined here is initially multimodal and not yet defined as being an individual speaker.  Cheng later further discloses that “the extracted audio features may individually or collectively be used to detect audio concepts and/or to identify individual speakers” (Cheng, 12:40-42).  It is after this disclosure that we see Cheng anticipate 
generating the second utterance output by combining data that is based on the second video-based output, the second acoustic-based output, and the second text-based output; and   (“The output generator module 114 and its submodules, a visual presentation generator module 116 and a natural language generator module 118, are each embodied as software, firmware, hardware, or a combination thereof.” (Cheng, Fig. 2 and related text, 6:35-39).  Further, Cheng discloses “presentation templates 142 provide the specifications that the output generator module 114 uses”, further specifying “the order in which to arrange the segments”, “the accompanying audio or text, and/or other aspects of the visual presentation 120”. (Cheng, Fig. 2 and related text, 7:25-27).  Cheng is thereby generating the second utterance output in a chronological sequence, by combining data that is based on the second video-based output, the second acoustic-based output, and the second text-based output.)
[…] combining the first utterance output and the second utterance output.  (Since Cheng explicitly states that “the extracted audio features may individually or collectively be used to detect audio concepts and/or to identify individual speakers” these identified individual speakers’ feature data (first utterance output and the second utterance output), together, have generated first combined speaker features for the duration in which these two utterances occurred (first time window).

Cheng does not explicitly teach: generating first combined speaker features for the first time window […] 
However, in an analogous art of recognizing events with conversational audio analytics, Wooters teaches: first combined speaker features for the first time window […] (Wooters discloses that “a user turn 108C may include utterances by the primary speaker that are not addressed at the conversational  responding to the barista's question posed in other speaker utterance 108C2.  Nevertheless, a conversational computing interface according to the present disclosure is configured to delineate turns in multi-speaker situations, of which the conversation depicted in FIG. 3B is one non-limiting example.  For example, a previously-trained model of the conversational computing interface may be trained on conversation histories in which a user's turn is interrupted, such as conversation history 106" depicted in FIG. 3B, thereby configuring the previously-trained model to recognize similar situations” (Wooters, Fig. 3B and related text, ¶0070-0071).   By training conversational computing models on turns (time windows) containing features of multiple speakers, Wooters is generating first combined speaker features for the first user turn (first time window) in segmented conversations.
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the multimodal analysis system, of Cheng, with the combined speaker utterance conversation segmentation of Wooters, in order to improve Cheng’s ability to “detect the presence of and recognize various elements of the visual scenes that are depicted” when it comes to multiple speakers conversing over video (Cheng, 1:38-39).

Regarding Claim 10:
The computing system of claim 9 further comprising: applying, to a sequential part of the machine learning system, the first […] speaker features, and an output, from the sequential part of the machine learning system, that was generated by the sequential part of the machine learning system in response to the sequential part of the machine learning system receiving at least second […] speaker features for a second time window prior to the first time window; and  (Cheng disclosed “learning modules 152, including feature learning modules 154, a concept learning module 156, a salient event learning module 158, and a template learning module 160.  The learning modules 152 execute machine 
	generating, using the sequential part of the machine learning system, one or more conversation analysis indicators in response to the receiving the first […] speaker features and the output for the second time window.  (Cheng discloses a saliency indicator 238 which is a resulting output after the feature detection modules and event detection modules consider multiple sequential machine learning models in the multimedia content understanding module 104 (Cheng, Fig. 1 and 2 and related text, 17:48-50).  Further, Cheng discloses using saliency indicators (conversation analysis indicators) to prioritize salient event segments for presentation processing at the output generator module stage which is after the sequential machine learning system processing of first [combined] speaker features (Cheng, Fig. 1, 2, 3, and related text, 17:64-18:3).  So, Cheng is using audio, visual, and text media from a first time window and a second time window before it, to generate individually for each, combined speaker features, and after that, a conversation analysis indicator.
Cheng does not explicitly teach: […] combined speaker features, […] second combined speaker features […] 
[…] first combined speaker features
applying […] combined speaker features, […] second combined speaker features […]  (Wooters discloses that “a user turn 108C may include utterances […] responding to the barista's question posed in other speaker utterance 108C2.  Nevertheless, a conversational computing interface according to the present disclosure is configured to delineate turns in multi-speaker situations, of which the conversation depicted in FIG. 3B is one non-limiting example.  For example, a previously-trained model of the conversational computing interface may be trained on conversation histories in which a user's turn is interrupted, such as conversation history 106" depicted in FIG. 3B, thereby configuring the previously-trained model to recognize similar situations” (Wooters, Fig. 3B and related text, ¶0070-0071).   Wooters is segmenting conversations into user turns and processing utterances from multiple users per user turn (time window), thereby generating combined speaker features for each consecutive user turn (second combined speaker features).
[…] first combined speaker features  (Wooters discloses that “a user turn 108C may include utterances […] responding to the barista's question posed in other speaker utterance 108C2.  Nevertheless, a conversational computing interface according to the present disclosure is configured to delineate turns in multi-speaker situations, of which the conversation depicted in FIG. 3B is one non-limiting example.  For example, a previously-trained model of the conversational computing interface may be trained on conversation histories in which a user's turn is interrupted, such as conversation history 106" depicted in FIG. 3B, thereby configuring the previously-trained model to recognize similar situations” (Wooters, Fig. 3B and related text, ¶0070-0071).   Wooters is segmenting conversations into user turns and processing utterances from multiple users per user turn (time window), thereby generating combined speaker features for each consecutive user turn including (first combined speaker features).


Regarding Claim 11:
	The computing system of claim 10 wherein the one or more conversation analysis indicators include at least one of an emotional score, an engagement score, a genuineness score, an intensity score, or any combination thereof.   (Cheng discloses “The models 134, 136 correlate semantic descriptions of audio, visual, text, etc. features and concepts with instances or combinations of output of the algorithms 130 that evidence those features and concepts. For example, the feature models 134 may define relationships between sets of low-level features detected by the algorithms 130 with semantic descriptions of those sets of audio, visual, text, etc. features (e.g., visual concept descriptions such as "object," "person," "face," "ball," "vehicle," and audio concept descriptions such as "happy", "annoyed," "excited," "calm," etc.)” (Cheng, Figs. 1, 2, 5 and related text, 5:11-20).  Further, Cheng links these emotional labels to confidence scores by describing that “Each or any of the models 134, 136 and/or the mapping 140 can maintain (e.g., probabilistic or statistical) indicators of the determined evidentiary significance of the relationships between features, concepts, events, and salient activities.  In some embodiments, indicators of evidentiary significance are determined using machine learning techniques” (Cheng, Fig. 1 and related text, 14:65-15:3).  Cheng is thereby assigning probabilistic/statistical indicators (confidence scores) to relationships between features and concepts such as “happy” (emotional labels).  Cheng is thereby associating the input media’s features with semantic description scores for emotional content such as excited, which may plausibly also serve as an engagement or intensity score.   

Regarding Claim 12:
	The computing system of claim 10, wherein the one or more conversation analysis indicators include at a set of emotional labels, and each emotional label further includes a confidence score and an intensity score. (Cheng disclosed semantic “audio concept descriptions such as “happy,” “annoyed,” “excited,” “calm,” etc.” (Cheng, Fig, 5:11-20).  Further, Cheng links these emotional labels to confidence scores by describing that “Each or any of the models 134, 136 and/or the mapping 140 can maintain (e.g., probabilistic or statistical) indicators of the determined evidentiary significance of the relationships between features, concepts, events, and salient activities.  In some embodiments, indicators of evidentiary significance are determined using machine learning techniques” (Cheng, Fig. 1 and related text, 14:65-15:3).  Cheng further states that “Prosodic features can be used to analyze the emotional or affective content of a speech signal” (Cheng, 12:16-19).  Further, Cheng states “In some embodiments, the extracted audio features may individually or collectively be used to detect audio concepts and/or to identify individual speakers or speaker-specific characteristics, including different emotions, affect, or different states of arousal or agitation, or changes in emotion/ affect/arousal state over time.  To do this, a bag-of-words representation may be used.  For instance, hierarchical K-means clustering can be used to build vocabularies from the extracted audio features” (Cheng, 12:40-48).   Further, “Using these vocabularies, the extracted features can be quantized to obtain a histogram representation corresponding to each feature type.”  Cheng is thereby quantizing probabilities (scores) that prosodic features are indicative of emotional labels such as arousal (intensity), as part of the process which generates saliency indicators (conversation analysis indicators).  



Regarding Claim 13:
	The computing system of claim 10, wherein the operations further comprise: storing a series of sets of conversation analysis indicators, each set of conversation analysis indicators corresponding to a segment of the conversation;  (Cheng discloses “saliency indicators 238 (FIG. 2), which indicate, for particular salient activities, a variable degree of saliency associated with the activity as it relates to a particular event” (Cheng, Fig. 2 and related text, 5:54:57).  Further, Cheng discloses “at block 332, the system 100 identifies the salient event segments 112 in the multimedia input file(s)” (Cheng, Fig. 3 and related text, 17:64-65).  These sets of saliency (conversation analysis) indicators correspond to segments are stored either on “the server computing device 650 may operate a “back end” portion 658 of the multimedia content assistant computing system 100” or on the user’s computing device 610 where “the storage media 620 may include one or more hard drives or other suitable data storage devices (e.g., flash memory, memory cards, memory sticks, and/or others)” (Cheng, Fig. 1, 6 and related text, 22:46-48).
	wherein the sets of conversation analysis indicators correspond to segments of the conversation that represent the entire conversation.  (Cheng discloses the salient event segments are parts of complex events which are made up of activities “examples of complex events include human interactions with other people (e.g., conversations, meetings, presentations, etc.) […] activities that make up a complex event are not limited to visual features.  Rather, “activities” as used herein may refer to, among other things, visual, audio, and/or text features, which may be detected by the computing system 100 in an automated fashion using a number of different algorithms and feature detection techniques as described in more detail” (Cheng, 3:41-51).  Since Cheng is automatically segmenting conversations into all constituent audio, visual, and text features, he is making it so the sets of conversation analysis indicators correspond to segments of the conversation that represent the entire conversation.

Regarding Claim 14:
The computing system of claim 8, wherein each of: the first video processing part of the machine learning system, the first acoustic processing part of the machine learning system, and the first textual processing part of the machine learning system is one of a convolutional neural network or a recurrent neural network.  (Wooters discloses “Machines may be implemented using any suitable combination of state-of-the-art and/or future ML, AI, statistical, and/or NLP techniques,” and provides non-limiting examples of “multi-layer neural networks, convolutional neural networks (e.g., including spatial convolutional neural networks for processing images and/or videos, temporal convolutional neural networks for processing audio signals and/or natural language sentences, and/or any other suitable convolutional neural networks configured to convolve and pool features across one or more temporal and/or spatial dimensions), recurrent neural networks (e.g.,”  (Wooters, ¶0081).  In the same paragraph, Wooters goes on to provide RNN examples.  Wooters thereby teaches convolutional neural networks (or recurrent neural networks) to handle the video, audio, and natural language sentence (textual) processing of the machine learning system. 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the multimodal analysis system, of Cheng, with the combined speaker utterance conversation segmentation of Wooters, in order to improve Cheng’s ability to “detect the presence of and recognize various elements of the visual scenes that are depicted” when it comes to multiple speakers conversing over video (Cheng, 1:38-39).

Regarding Claim 16:
Cheng teaches the computer-readable storage medium of claim 15 wherein the actions further 
comprise: generating a second utterance output by applying a second utterance representation from the 
conversation to the machine learning system, wherein the second utterance representation is associated with a second user and corresponds to a first time window that also corresponds to the first utterance representation, and wherein generating the second utterance representation includes:  (Cheng’s multimedia data, segmented by association with utterances of a conversation, is processed through a multimedia content understanding module 104, in which a machine learning system processes a multiple utterance representations, and includes a “salient event learning module 158, and a template learning module 160.  The learning modules 152 execute machine learning algorithms” (Cheng, Fig. 1 and 2 and related text, 7:47-49).  After processing through the multimedia content understanding module, these utterance representations are sent to the output generator module 114 before being set to the output generator module.  Since Cheng discloses “the extracted audio features may individually or collectively be used to detect audio concepts and/or to identify individual speakers”, we know that in the case of features being aggregated over the utterance level, they are not yet necessarily also aggregated over the individual speaker level (Cheng, 12:23-24).  Cheng is thereby generating utterance outputs, including a second utterance output, by applying utterance representations (including a second utterance representation), associated with two or more users (including a second user) and corresponding to the duration in which these two utterances occurred (first time window), to a machine learning system in order to generate utterance outputs (including a second utterance output).)   
applying the video data of the second utterance representation to a second video processing part of the machine learning system to generate second video-based output,  (“The visual feature detection module 212 analyses each segment 204 using the visual feature models 236, and outputs a set of virtual features 220 that have been detected in the segment 204.  To do this, the visual feature detection module 212 employs a number of automated feature recognition algorithms” (Cheng, Figure 2 and related text, 10:52-57).  Since Cheng directs to “aggregate these features over the utterance level” (Cheng, 12:24) and is associating the time windows of input text media’s features with other medias to 152 execute machine learning algorithms on samples of multimedia content (images and/or video) of an image/video collection” (Cheng, Fig. 2 and related text, 7:48-50).  Since the learning modules are located in the multimedia content understanding module before the media data is sent to the output generator, Cheng is applying video data of the second segment of video features (second utterance representation) to a second video processing part of the machine learning system to generate second video-based output.)
applying the acoustic data of the second utterance representation to a second acoustic processing part of the machine learning system to generate second acoustic-based output,  (“example, the audio feature detection module 214 may detect an acoustic characteristic of the audio track of a certain segment of an  input video 102, and, with the audio feature model (which may be trained using machine learning techniques)”, (Cheng, Fig. 2 and related text, 12:2-6).  Since Cheng directs to “aggregate these features over the utterance level” (Cheng, 12:24) and is associating the time windows of input text media’s features with other medias to provide semantic descriptions of the conversations segments, as noted in claim 1, we know the utterance level is initially multimodal and not yet defined as being an individual speaker.  Cheng is thereby applying a second utterance representation, that is associated with a second user, to a second part of the machine learning system, to generate a second utterance output.)
generating the second utterance output by combining data that is based on the second video-based output and the second acoustic-based output; and   (“The output generator module 114 and its submodules, a visual presentation generator module 116 and a natural language generator module 118, are each embodied as software, firmware, hardware, or a combination thereof.” (Cheng, Fig. 2 and related text, 6:35-39).  Further, Cheng discloses “presentation templates 142 provide the specifications 
[…] combining the first utterance output and the second utterance output.  (Since Cheng explicitly states that “the extracted audio features may individually or collectively be used to detect audio concepts and/or to identify individual speakers” these identified individual speakers’ feature data (first utterance output and the second utterance output), together, have generated first combined speaker features for the duration in which these two utterances occurred (first time window).
Cheng does not explicitly teach: generating first combined speaker features for the first time window […]
However, in an analogous art of recognizing events with conversational audio analytics, Wooters teaches: first combined speaker features for the first time window […] (Wooters discloses that “a user turn 108C may include utterances by the primary speaker that are not addressed at the conversational computing interface, for example, as with user utterance 108C3 responding to the barista's question posed in other speaker utterance 108C2.  Nevertheless, a conversational computing interface according to the present disclosure is configured to delineate turns in multi-speaker situations, of which the conversation depicted in FIG. 3B is one non-limiting example.  For example, a previously-trained model of the conversational computing interface may be trained on conversation histories in which a user's turn is interrupted, such as conversation history 106" depicted in FIG. 3B, thereby configuring the previously-trained model to recognize similar situations” (Wooters, Fig. 3B and related text, ¶0070-0071).   By training conversational computing models on turns (time windows) containing features of 
It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the multimodal analysis system, of Cheng, with the combined speaker utterance conversation segmentation of Wooters, in order to improve Cheng’s ability to “detect the presence of and recognize various elements of the visual scenes that are depicted” when it comes to multiple speakers conversing over video (Cheng, 1:38-39).

Regarding Claim 17:
The computer-readable storage medium of claim 16 wherein the actions further comprise: applying, to a sequential part of the machine learning system, the first […] speaker features, and an output, from the sequential part of the machine learning system, that was generated by the sequential part of the machine learning system in response to the sequential part of the machine learning system receiving at least second […] speaker features for a second time window prior to the first time window; and  (Cheng disclosed “learning modules 152, including feature learning modules 154, a concept learning module 156, a salient event learning module 158, and a template learning module 160.  The learning modules 152 execute machine learning algorithms on samples […] and/or update portions of the knowledge base 132 and/or the presentation templates”  (Cheng, Fig. 1, 7:45-52).  These learning modules use machine learning to create portions of multiple models in the multimedia content knowledge base 132, the output generator module 114’s presentation templates, and the multimedia content understanding module 104.  Since machine learning models that input and/or output sequences of data are sequential machine learning models, these streams of multimedia machine learning processing are all sequential machine learning.  Cheng is applying consecutive segments of machine learning system outputs from one sequential part of the machine learning system to another, in 
	generating, using the sequential part of the machine learning system, one or more conversation analysis indicators in response to the receiving the first […] speaker features and the output for the second time window.  (Cheng discloses a saliency indicator 238 which is a resulting output after the feature detection modules and event detection modules consider multiple sequential machine learning models in the multimedia content understanding module 104 (Cheng, Fig. 1 and 2 and related text, 17:48-50).  Further, Cheng discloses using saliency indicators (conversation analysis indicators) to prioritize salient event segments for presentation processing at the output generator module stage which is after the sequential machine learning system processing of first [combined] speaker features (Cheng, Fig. 1, 2, 3, and related text, 17:64-18:3).  So, Cheng is using audio, visual, and text media from a first time window and a second time window before it, to generate individually for each, combined speaker features, and after that, a conversation analysis indicator.

Cheng does not explicitly teach: […] combined speaker features, […] second combined speaker features […] 
[…] first combined speaker features

However, in an analogous art of recognizing events with conversational audio analytics, Wooters teaches: applying […] combined speaker features, […] second combined speaker features […]  (Wooters discloses that “a user turn 108C may include utterances […] responding to the barista's question posed in other speaker utterance 108C2.  Nevertheless, a conversational computing interface according to the present disclosure is configured to delineate turns in multi-speaker situations, of which the conversation depicted in FIG. 3B is one non-limiting example.  For example, a previously-trained model of the  depicted in FIG. 3B, thereby configuring the previously-trained model to recognize similar situations” (Wooters, Fig. 3B and related text, ¶0070-0071).   Wooters is segmenting conversations into user turns and processing utterances from multiple users per user turn (time window), thereby generating combined speaker features for each consecutive user turn (second combined speaker features).
[…] first combined speaker features  (Wooters discloses that “a user turn 108C may include utterances […] responding to the barista's question posed in other speaker utterance 108C2.  Nevertheless, a conversational computing interface according to the present disclosure is configured to delineate turns in multi-speaker situations, of which the conversation depicted in FIG. 3B is one non-limiting example.  For example, a previously-trained model of the conversational computing interface may be trained on conversation histories in which a user's turn is interrupted, such as conversation history 106" depicted in FIG. 3B, thereby configuring the previously-trained model to recognize similar situations” (Wooters, Fig. 3B and related text, ¶0070-0071).   Wooters is segmenting conversations into user turns and processing utterances from multiple users per user turn (time window), thereby generating combined speaker features for each consecutive user turn including (first combined speaker features).

It would have been obvious to one of ordinary skill in the art before the effective filing date of the present application to combine the multimodal analysis system, of Cheng, with the combined speaker utterance conversation segmentation of Wooters, in order to improve Cheng’s ability to “detect the presence of and recognize various elements of the visual scenes that are depicted” when it comes to multiple speakers conversing over video (Cheng, 1:38-39).

Regarding Claim 18:
	The computing system of claim 17 wherein the one or more conversation analysis indicators include at least one of a user confidence score, conversation quality score, an enthusiasm score, an attention score, a goal discussion score, an emotional suppression score, an uncertainty reduction score, a nonconscious mimicry score, or any combination thereof.  (Cheng discloses “The models 134, 136 correlate semantic descriptions of audio, visual, text, etc. features and concepts with instances or combinations of output of the algorithms 130 that evidence those features and concepts. For example, the feature models 134 may define relationships between sets of low-level features detected by the algorithms 130 with semantic descriptions of those sets of audio, visual, text, etc. features (e.g., visual concept descriptions such as "object," "person," "face," "ball," "vehicle," and audio concept descriptions such as "happy", "annoyed," "excited," "calm," etc.)” (Cheng, Figs. 1, 2, 5 and related text, 5:11-20).  Further, Cheng links these emotional labels to confidence scores by describing that “Each or any of the models 134, 136 and/or the mapping 140 can maintain (e.g., probabilistic or statistical) indicators of the determined evidentiary significance of the relationships between features, concepts, events, and salient activities.  In some embodiments, indicators of evidentiary significance are determined using machine learning techniques” (Cheng, Fig. 1 and related text, 14:65-15:3).  Cheng is thereby assigning probabilistic/statistical indicators (scores) to relationships between features and concepts such as “excited” (enthusiasm).  Further, Cheng states “In some embodiments, the extracted audio features may individually or collectively be used to detect audio concepts and/or to identify individual speakers or speaker-specific characteristics, including different emotions, affect, or different states of arousal or agitation, or changes in emotion/affect/arousal state over time.” (Cheng, 12:40-48).  Cheng is thereby associating the input media’s features with semantic description scores for emotional content such as excited, which is an enthusiasm score, and arousal, which may be either an attention score or an 

Regarding Claim 19:
	The computer-readable storage medium of claim 17 wherein the one or more conversation analysis indicators include at a set of emotional labels, and each emotional label further includes a confidence score and an intensity score. (Cheng disclosed semantic “audio concept descriptions such as “happy,” “annoyed,” “excited,” “calm,” etc.” (Cheng, Fig, 5:11-20).  Further, Cheng links these emotional labels to confidence scores by describing that “Each or any of the models 134, 136 and/or the mapping 140 can maintain (e.g., probabilistic or statistical) indicators of the determined evidentiary significance of the relationships between features, concepts, events, and salient activities.  In some embodiments, indicators of evidentiary significance are determined using machine learning techniques” (Cheng, Fig. 1 and related text, 14:65-15:3).  Cheng further states that “Prosodic features can be used to analyze the emotional or affective content of a speech signal” (Cheng, 12:16-19).  Further, Cheng states “In some embodiments, the extracted audio features may individually or collectively be used to detect audio concepts and/or to identify individual speakers or speaker-specific characteristics, including different emotions, affect, or different states of arousal or agitation, or changes in emotion/affect/arousal state over time.  To do this, a bag-of-words representation may be used.  For instance, hierarchical K-means clustering can be used to build vocabularies from the extracted audio features” (Cheng, 12:40-48).   Further, “Using these vocabularies, the extracted features can be quantized to obtain a histogram representation corresponding to each feature type.”  Cheng is thereby quantizing probabilities (scores) that prosodic features are indicative of emotional labels such as arousal (intensity), as part of the process which generates saliency indicators (conversation analysis indicators).  

Regarding Claim 20:
	The computer-readable storage medium of claim 17, wherein the actions further comprise: storing a series of sets of conversation analysis indicators, each set of conversation analysis indicators corresponding to a segment of the conversation; and (Cheng discloses “saliency indicators 238 (FIG. 2), which indicate, for particular salient activities, a variable degree of saliency associated with the activity as it relates to a particular event” (Cheng, Fig. 2 and related text, 5:54:57).  Further, Cheng discloses “at block 332, the system 100 identifies the salient event segments 112 in the multimedia input file(s)” (Cheng, Fig. 3 and related text, 17:64-65).  These sets of saliency (conversation analysis) indicators correspond to segments are stored either on “the server computing device 650 may operate a “back end” portion 658 of the multimedia content assistant computing system 100” or on the user’s computing device 610 where “the storage media 620 may include one or more hard drives or other suitable data storage devices (e.g., flash memory, memory cards, memory sticks, and/or others)” (Cheng, Fig. 1, 6 and related text, 22:46-48).
	wherein the sets of conversation analysis indicators correspond to segments of the conversation that represent the entire conversation.  (Cheng discloses the salient event segments are parts of complex events which are made up of activities “examples of complex events include human interactions with other people (e.g., conversations, meetings, presentations, etc.) […] activities that make up a complex event are not limited to visual features.  Rather, “activities” as used herein may refer to, among other things, visual, audio, and/or text features, which may be detected by the computing system 100 in an automated fashion using a number of different algorithms and feature detection techniques as described in more detail” (Cheng, 3:41-51).  Since Cheng is automatically segmenting conversations into all constituent audio, visual, and text features, he is making it so the sets of conversation analysis indicators correspond to segments of the conversation that represent the entire conversation.

Conclusion
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the mailing date of this final action.  
Any inquiry concerning this communication or earlier communications from the examiner should be directed to PIERCE ANDREW MOONEY whose telephone number is (571)272-0971. The examiner can normally be reached Monday-Friday 8:30am-5:30pm EST.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached on 571-272-5551. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like 




e/PIERCE ANDREW MOONEY/Examiner, Art Unit 2657                                                                                                                                                                                                        

/DANIEL C WASHBURN/Supervisory Patent Examiner, Art Unit 2657