Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
Continued Examination Under 37 CFR 1.114 
A request for continued examination under, including the fee set forth in 37 CFR1.17(e), was filed in this application after final rejection. Since this application is eligiblefor continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e)has been timely paid, the finality of the previous Office action has been withdrawnpursuant to 37 CFR 1.114. Applicant's submission filed on 12/28/2021 has been entered.
Status of the Claims
Claims 1, 5-8, 15-18, 22-33, and 36-38 are pending. 
Response to Applicant’s Argument
In response to “Applicant's amended claim 1 recites "...wherein each voice segment corresponds to a portion of the recording that includes a voice of one of the multiple speakers, a first voice segment including the voice of one of the multiple speakers in a conversation with a first customer and a second voice segment including the voice of one of the multiple speakers in a conversation with a second customer..." Beaumont is entirely silent on the segments of the part(ies) to whom the target speaker is speaking. One of skill in the art could not presume from this silence of Beaumont the party to who the voice segments relate ” and “Likewise, Sidi and Ramaswamy do not teach anything related to voice segments of different speakers or customers, as claimed”. 
Beaumont teaches using speaker recognition to identify portions of audio data pertaining to web conferencing containing potential human speakers, although it may not be known which is a human speaker and which is machine generated human speech from a virtual assistant (¶37 and ¶42).
Sidi teaches a parallel process of blind diarization at 110 resulting in audio data of separated speakers at 112 comprising homogeneous speaker segments in the audio data are tagged as being associated with a first speaker or a second speaker (¶35), although the identities of the speakers (e.g., agent, customer) are not known at the blind diarization stage (¶30). 

    PNG
    media_image1.png
    537
    850
    media_image1.png
    Greyscale

¶36). In customer service context, this includes the identification of which speaker is the customer service agent (¶36). In particular, in audio data containing more than two speakers such as one customer service agent and two customers (¶43); i.e., in this audio data, there is at least a first homogeneous speaker segment tagged as the agent speaking to the first of the two customers and a second homogeneous speaker segment tagged as the agent speaking to the second of the two customers (¶37; see also ¶39, at 118, homogeneous speaker segments matching voiceprint model 116 tagged in the audio file as being the agent and other homogeneous speaker segments matching speech models of customers per ¶38 are tagged as being the other speaker identified as the customer).
Beaumont requires speaker recognition identifying portions of the audio data containing potential human speakers while not knowing which is a human speaker (¶37) and thereafter requires highlighting identified primary speaker’s name in web conferencing application (¶42). 
Sidi provides a process of blind diarization separating speaker segments in audio data of one customer service agent and two customers (Sidi, ¶35 and ¶43) similar to the speaker recognition portion of Beaumont and thereafter performing speaker diarization separating audio data into homogeneous speech segments comprising at least a first speech segment of the agent speaking to a first customer and a second speech segment of the agent speaking to the second customer (Sidi, ¶38-39) that fulfills Beaumont’s need to identify a speaker’s name (Beaumont, ¶42). 
Sidi to identify first voice segment of the voice of the agent in a conversation with a first customer and a second voice segment of the voice of the agent in another conversation with the second customer in order to identify at least one speaker in the audio data (Sidi, ¶43; compare Beaumont, ¶42, identify the primary speaker’s name).
Claim Rejections - 35 USC § 103
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 103 that form the basis for the rejections under this section made in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claims 1, 5, and 22 are rejected under 35 USC 103(a) as being unpatentable over Beaumont et al. (US 2015/0088515 A1) in view of Sidi et al. (US 2015/0025887 A1) and Ramaswamy et al. (US 6490560 B1).

Regarding Claim 1, Beaumont discloses a method, comprising: 
receiving, by a computer system, a recording of a conversion between multiple speakers (¶28, capturing audio and video data; ¶1, ¶42, web conferencing where users communicate voice data and video feeds), wherein the recording includes voice data, video data, and metadata that provides information regarding at least one of the multiple speakers (¶29 and ¶36, audio/video data of a crowed audio environment where there are at least two human speakers; ¶32, video data and audio data contain time stamps; ¶33, audio data contains therein directionality information related to a speaker that may be leveraged in the analysis of video data); 
generating, by the computer system, multiple sets / groups of voice segments from the recording, 
wherein each set of voice segments includes one or more voice segments corresponding to one of the multiple speakers, wherein each voice segment corresponds to a portion of the recording that includes a voice of one of the multiple speakers (¶37 and ¶39, identifying a first portion of audio + video with aligned timestamps corresponding to a first primary speaker and identifying a second portion of audio + video with aligned timestamps corresponding to a second primary speaker; ¶42, isolating the primary speaker’s audio data input from other speakers / noise); 
identifying, by the computer system, a given set of voice segments associated with a given speaker of the multiple speakers (¶42, isolating the primary speaker’s audio data input from other speakers / noise);
determining, by the computer system, a speaker identification parameter by analyzing the voice data associated with the given set of voice segments (¶31-33 and ¶37-39, directionality information in audio data and timestamp information in audio + video data are used to match audio data of one speaker with video data of the same speaker and to perform speaker identification), wherein the multiple speaker identification parameters are representative of audio features or video features that are used to identify the given speaker (¶37 and ¶39, in addition to speaker recognition to identify a portion of audio data containing potential human speaker, video data containing visual features associated with speech are additionally used to identify the first primary speaker as distinct from another primary speaker and as non-machine generated audio features); 
determining, by the computer system, an identity of the speaker (¶39, identifying a primary speaker for each portion of matched audio data and video data; ¶42, highlighting the identified primary speaker’s name in a web conferencing application).
Beaumont does not disclose determining the identity of the given speaker by comparing the speaker identification parameter to a series of fingerprints to discover a matching fingerprint, wherein the matching fingerprint is representative of a vocal characteristic of the given speaker derived from a past recording of a past conversation involving the given speaker and at least one other speaker, and retrieving identification information related to the given speaker from a network accessible source to assign the identification information to the given set of voice segments associated with the given speaker.
Sidi discloses a server computer (¶20 and ¶59, Fig. 3, computer system 300 / centralized server) generating multiple sets of voice segments from a recording includes voice data and metadata (¶27-28, audio data 102 comprises real time streaming audio data and metadata 108 / identification data) by:
applying a statistical function to the recording to mitigate background noise (¶31 and ¶33, perform a blind diarization process to separate audio data into speech frames and non-speech (other speakers and background noise per ¶38) frames using a plurality of probabilities based on audio energy envelopes where blind diarization process then filters out non-speech frames), 
Fig. 1, ¶31, filtering out non-speech frames may be performed by removing a frame for blind diarization 110), and 
clustering the voice segments into multiple sets such that voice segments with similar features are assigned to the same cluster (¶32, after the audio file has been segmented, the identified segments are clustered into speakers (e.g., speaker 1, speaker 2, speaker N));
wherein each voice segment corresponds to a portion of the recording that includes a voice of one of the multiple speakers, a first voice segment including the voice of one of the multiple speakers in a conversation with a first customer and a second voice segment including the voice of one of the multiple speakers in a conversation with a second customer (¶36 and ¶¶38-39, cluster speech segments into groups of speech segments having the same speaker, compare the clustered groups to speech models of known agents or customers, and identify homogeneous speaker segments as customer service agent or customers; ¶43, identify and separate speakers in audio data that contain one customer service agent and two customers; i.e., in the clustered group of speech segments tagged as customer service agent, there is at least a first speech segment of the agent speaking to a first one of the two customers and at least a second speech segment of the agent speaking to a second one of the two customers). 
Further, the server computer determines a first speaker identification parameter by establishing, based on voice data associated with a given set of voice segments, a vocal characteristic of a given speaker (¶36 and see Fig. 1, speaker diarization 114 to identify tagged speakers using acoustic voiceprints) and determining an identity of a given speaker by comparing the first speaker identification parameter to a series of fingerprints to discover ¶39, compare voiceprint model 116 to homogeneous speaker audio data segments for multiple speakers to determine which separated speaker audio data segments have a greater likelihood of matching the acoustic voiceprint model 116 in order to tag the segments in the audio file in the metadata), wherein the matching fingerprint is representative of a vocal characteristic of the given speaker derived from a past recording of a past conversation involving the given speaker and at least one other speaker (¶45 and ¶48, creating voiceprint models from audio files in repository 207 stored in association with an agent identification number where each audio file having speaker segment clusters that belong to the customer service agent and speaker segment clusters belong to customers), establishing the identity based on the matching fingerprint (¶39, speaker diarization 114 tags each homogeneous speaker audio data segments as the speaker identified in the metadata), and retrieving identification information related to the given speaker from a network accessible source to assign the identification information to the given set of voice segments associated with the given speaker (¶51 and ¶59, SST server performs diarization using speaker specific voiceprint models stored in a centralized storage database).
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to modify Beaumont to compare audio features in voice segments to fingerprint in a data storage system as taught by Sidi in order to perform speaker identification (Sidi, ¶15, enhancing speech transcription by diarization wherein a speaker identity is identified and Beaumont, ¶29, speaker identification).
Beaumont-Sidi combination does not disclose determining a second speaker identification parameter by generating, based on the voice data associated with the given set 
Ramaswamy teaches a conversational system (Col 4, Rows 45-50) comprising an acoustic and biometric verifier for identifying and verifying an identity of a user (Col 4, Rows 51-55) by determining a first speaker identification parameter by establishing a vocal characteristic of a given speaker based on voice data associated with the speaker (Col 8, Rows 35-40, acoustic scoring; in view of Col 4, Rows 55-59, matching acoustic signature to known acoustic signature of a user) and a second speaker identification parameter by generating a language model that specifies terms and/or phrases used by the given speaker in the conversation based on the voice data associated with the given speaker (Col 5, Rows 6-20 and Rows 22-38, use language model to generate LM score characterizing the user with respect to the choice of words and phrases often used and incorporate the LM score as a feature in the feature vector), and determining an identity of the given speaker by comparing the first speaker identification parameter to a series of fingerprints to discover a matching fingerprint (Col 4, Rows 55-59, matching an acoustic signature of the person to a known acoustic signature of the user in an acoustic verification process) and establishing the identity based on the matching fingerprint and the language model (Col 7, Rows 48-67, using the language model features to construct a behavior model; Col 8, Rows 13-29, using the behavior model to calculate behavior score; Col 8, Rows 30-57, mix the behavior score with an acoustic / biometric score to calculate overall score Ptotal and compare Ptotal(t) to threshold Pth to verify the identity of the user).
Beaumont-Sidi to, in combination with fingerprint / voiceprint matching, identify an identity of a speaker by comparing fingerprint speaker identification parameters to fingerprints of known users and establishing (i.e., verifying) the identity of the speaker provided by the matching fingerprint according to language model identification parameter specifying terms / phrases used by the given speaker in order to verify a user from imposters (Ramaswamy, Col 8, Rows 50-54, Sidi, ¶42, using both acoustic voiceprint model and a linguistic model to identify errors in speaker separation phases to highlight portions of audio data within which two models disagree and providing for more detailed analysis on those areas to arrive at the correct speaker labeling).
Regarding Claim 5, Beaumont discloses wherein the recording further includes video data and generating, using the video data, a facial image of the speaker as another speaker-identification parameters (¶17, using facial recognition technology to aid in the detection and identification of primary speaker; see also Woodward, Col 3, Rows 34-37, using video patterns (facial) and pre-recorded audio patterns together to identify speakers).  
Regarding Claim 22, Ramaswamy discloses determining identity of speaker based on matching language model by comparing the language model of the given speaker with multiple language models of a group of speakers stored at a storage system to identify a matching language model (Col 5, Rows 22-29 in view of Col 3, Rows 55-57 and Col 4, Rows 18-20, speech recognition engine implemented by processor and memory uses a set of language models to perform recognition by generating language model scores, where some of the models may be personalized to a given user (built using words and phrases spoken frequently by a given user)), and determining the identity of the given speaker based on identity associated with the matching language model (Col 5, Rows 28-44, the LM scores carry information characterize the user’s choice of words / phrases often used and therefore used as features to generate behavior scores and periodically checked against a threshold to determine if the user is an imposter or not verified; see Col 7, Rows 48 – 53 and Col 8, Rows 1-12 and Rows 47-50).
Claims 6-8 are rejected under 35 USC 103(a) as being unpatentable over Beaumont et al. (US 2015/0088515 A1) in view of Sidi et al. (US 2015/0025887 A1) and Ramaswamy et al. (US 6490560 B1) as applied to claim 1, in further view of Woodward et al. (US 8983836 B2).
Regarding Claim 6, Beaumont does not disclose comparing facial image of given speaker with facial images of multiple speakers to find a matching image and determine identity of the given speaker based on identification information associated with the matching image.
Woodward discloses a system configured to identify a given set of voice segments associated with a given speaker of multiple speakers in a recording of voice data, video data, and metadata regarding speakers (Col 3, Rows 1-5, automatic speech recognition system processing audio / multimedia content in real-time; Col 3, Rows 10-25, a pre-processor analyzes multimedia audio and video to identify segments / homogenous regions with the same speaker), determine multiple speaker identification parameters by analyzing the voice data, the video data, and/or the metadata associated with the given set of voice segments where the multiple speaker identification parameters are representative of audio Col 3, Rows 39-54, for example facial recognition analysis of video data for the segment may be used to identify facial features of speakers in the video frame; audio pattern matching analysis on segments to identify sources of sounds and noises, and metadata analysis looks at metadata associated with multimedia content like time / date information), determining an identity of the given speaker based on a presence of at least one of the multiple speaker identification parameters (Col 3, Rows 58-62, both image data and audio pattern information for performing identifications of individuals in a segment may be retrieved from private and public social network service sources), and retrieving identification information related to the given speaker from a network accessible source (Col 4, Rows 1-20, through audio, video, and metadata based analysis perform identification of the speakers in the segment to retrieve user profile information corresponding to the identified speakers from the social network service). 
In particular, Woodward discloses retrieving, using the metadata, one or more facial images of the multiple speakers from one or more sources (Col 3, Rows 39-41, facial recognition analysis of the video data for the segment may be used to identify the facial features of speakers in the video frame), comparing the facial image of the given speaker with one or more facial images of the multiple speakers to find a matching image and determining identity of the given speaker based on identification information associated with the matching image (Col 3, Rows 40-45, identified facial features of the speakers are compared to pictures obtained from social network service sources to identify particular individuals within the video frame / video segment).
Beaumont requires when a primary speaker has been identified, one or more actions on the basis of this identification includes highlighting the identified primary speaker’s name in a web conferencing application (¶42).
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to implement Beaumont’s speaker identification analysis (Beaumont, ¶29, more robust analyses, e.g., speaker identification) by analyzing the voice data, the video data, and the metadata to determine multiple speaker identification parameters to retrieve identification information related to given speaker from a network accessible source to identify segments with the same speaker speaking (Woodward, Col 3, Rows 15-20) and to identify the speaker through facial recognition, audio pattern matching, and metadata analysis (Woodward, Col 3, Rows 30-35) to gather user profile information like the identified primary speaker’s name in a web conferencing application to implement the assigning of identification information of the given speaker to the given set of voice segments associated with the given speaker (Woodward, Col 4, Rows 18-20; Beaumont, ¶42, determine and highlighting identified primary speaker’s name). 
Regarding Claim 7, Woodward discloses wherein retrieving the one or more facial images includes: retrieving, using an email identification (ID) or a name of the given speaker from the metadata, the one or more facial images of the multiple speakers from a social networking service (Col 3, Rows 47-54 and Col 4, Rows 15-20, use metadata analysis to obtain identifiers of names of persons in the multimedia content and to obtain user profile information corresponding to the identified speakers).  
Regarding Claim 8, Woodward discloses wherein retrieving the one or more facial images includes: retrieving, using an email identification (ID) or a name of the speaker Col 3, Rows 47-54, metadata analysis to extract identifiers of names of persons in the multimedia content; Col 4, Rows 13-17, based on the identification of the speakers in the segment through metadata based analysis to retrieve user profile information of the identified speaker from social network service, which may be private (such as organization) per Col 3, Rows 60-61).  
Claims 23-25, 28, 30, and 32 are rejected under 35 USC 103(a) as being unpatentable over Beaumont et al. (US 2015/0088515 A1) in view of Sidi et al. (US 2015/0025887 A1) and Woodward et al. (US 8983836 B2).
Regarding Claim 23, Beaumont discloses non-transitory computer readable storage medium storing instructions that are executable by a speaker identification system (¶29, using speaker recognition techniques to analyze audio data to disambiguate human speech from background noises and perform more robust analyses such as speaker identification) comprising: 
receiving a real time web conferencing data of a conversion between multiple speakers (¶28, capturing audio and video data; ¶33, performing primary speaker identification in real time; ¶42, web conferencing application where users communicate voice data and video feeds), wherein the real time web conferencing data includes voice data, video data, and metadata that provides information regarding at least one of the multiple speakers (¶29 and ¶36, audio/video data of a crowed audio environment where there are at least two human speakers; ¶32, video data and audio data contain time stamps; ¶33, audio data contains therein directionality information related to a speaker that may be leveraged in the analysis of video data); 
¶37 and ¶39, identifying a first portion of audio + video with aligned timestamps corresponding to a first primary speaker and identifying a second portion of audio + video with aligned timestamps corresponding to a second primary speaker; ¶42, isolating the primary speaker’s audio data input from other speakers / noise); 
identifying a given set / group of voice segments associated with a given speaker of the multiple speakers (¶42, isolating the primary speaker’s audio data input from other speakers / noise);
determining a speaker identification parameter by analyzing the voice data and video data associated with the given set of voice segments (¶31-33 and ¶37-39, directionality information in audio data and timestamp information in audio + video data are used to match audio data of one speaker with video data of the same speaker and to perform speaker identification), wherein a first speaker-identification parameter of the multiple speaker-identification parameters is representative of an audio feature to be used to identify the given speaker, and wherein a second speaker identification parameter of the multiple speaker-identification parameters is representative of a video feature to be used to identify the given speaker (¶37 and ¶39, in addition to speaker recognition to identify a portion of audio data containing potential human speaker, video data containing visual features associated with speech are additionally used to identify the first primary speaker as distinct from another primary speaker and as non-machine generated audio features); 
¶39, identifying a primary speaker for each portion of matched audio data and video data; ¶42, highlighting the identified primary speaker’s name in a web conferencing application).
Beaumont does not disclose wherein each voice segment corresponds to a portion of the recording that includes a voice of one of the multiple speakers, a first voice segment including the voice of one of the multiple speakers in a conversation with a first customer and a second voice segment including the voice of one of the multiple speakers in a conversation with a second customer.
Sidi discloses a server computer (¶20 and ¶59, Fig. 3, computer system 300 / centralized server) generating multiple sets of voice segments from a recording includes voice data and metadata (¶27-28, audio data 102 comprises real time streaming audio data and metadata 108 / identification data) wherein each voice segment corresponds to a portion of the recording that includes a voice of one of the multiple speakers, a first voice segment including the voice of one of the multiple speakers in a conversation with a first customer and a second voice segment including the voice of one of the multiple speakers in a conversation with a second customer (¶36 and ¶¶38-39, cluster speech segments into groups of speech segments having the same speaker, compare the clustered groups to speech models of known agents or customers, and identify homogeneous speaker segments as customer service agent or customers; ¶43, identify and separate speakers in audio data that contain one customer service agent and two customers; i.e., in the clustered group of speech segments tagged as customer service agent, there is at least a first speech segment of the agent speaking to a first one of the two customers and at least a second speech segment of the agent speaking to a second one of the two customers). 
Beaumont requires speaker recognition identifying portions of the audio data containing potential human speakers while not knowing which is a human speaker (¶37) and thereafter requires highlighting identified primary speaker’s name in web conferencing application (¶42). 
Sidi provides a process of blind diarization separating speaker segments in audio data of one customer service agent and two customers (Sidi, ¶35 and ¶43) similar to the speaker recognition portion of Beaumont and thereafter performing speaker diarization separating audio data into homogeneous speech segments comprising at least a first speech segment of the agent speaking to a first customer and a second speech segment of the agent speaking to the second customer (Sidi, ¶38-39) that fulfills Beaumont’s need to identify a speaker’s name (Beaumont, ¶42). 
Therefore, it would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to implement the speaker diarization process of Sidi to identify first voice segment of the voice of the agent in a conversation with a first customer and a second voice segment of the voice of the agent in another conversation with the second customer in order to identify at least one speaker in the audio data (Sidi, ¶43; compare Beaumont, ¶42, identify the primary speaker’s name).
Beaumont does not disclose determining an identity of the speaker by comparing the first speaker-identification parameter to data from a first information system that internal to 
Woodward discloses a system for captioning audio and multimedia content using acoustic profiles derived from social network sources in real time (Col 2, Row 63 – Col 3, Row 3) where the system is configured to identify a given set of voice segments associated with a given speaker of multiple speakers in a recording of voice data, video data, and metadata regarding speakers (Col 3, Rows 1-5, automatic speech recognition system processing audio / multimedia content in real-time; Col 3, Rows 10-25, a pre-processor analyzes multimedia audio and video to identify segments / homogenous regions with the same speaker), 
determining an identity of the speaker by comparing a first speaker-identification parameter representative of an audio feature to be used to identify the given speaker to data from a first information system that internal to an organization where the speaker identification system is deployed (Col 3, Rows 13-17 and Rows 23-26, analyze audio track to identify segments or homogeneous regions with the same speaker speaking with a same background noise along a time line of the audio track; Col 3, Rows 27-33, perform audio pattern matching analysis on the segments to perform identification analysis; Col 3, Rows 58-62, the audio pattern information for performing the identification of individuals in segments may be retrieved from private (such as organization) social network service sources), 
Col 3, Rows 39-44, 58-61, and Col 4, Rows 2-12, perform facial recognition analysis of the video data for the segment to identify facial features of speakers in the video frame and compare to pictures or image data obtained from social network service sources to identify the particular individuals, the social network service source may be web based for user to interact over the internet (e.g., Facebook, Twitter, etc.)), and 
assigning identification information of the given speaker to the given group of voice segments associated with the given speaker (Col 3, Rows 27-30, perform identification analysis on each of the segments to identify the speaker in the particular segment; Col 10, Rows 15-20, correlate the identified speakers with user profiles in a social network service system).
Beaumont requires when a primary speaker has been identified, one or more actions on the basis of this identification includes highlighting the identified primary speaker’s name in a web conferencing application (¶42).
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to implement Beaumont’s speaker identification analysis (Beaumont, ¶29, more robust analyses, e.g., speaker identification) according to the specific implementation of Woodward by analyzing the voice data, the video data, and the metadata to determine multiple speaker identification parameters to retrieve identification information related to given speaker from multiple network accessible sources to identify segments with Woodward, Col 3, Rows 15-20) and to identify the speaker through facial recognition, audio pattern matching, and metadata analysis (Woodward, Col 3, Rows 30-35). For example, to gather user profile information like the identified primary speaker’s name for a web conferencing application by assigning of identification information of the given speaker to the given set of voice segments associated with the given speaker (Woodward, Col 4, Rows 18-20; Beaumont, ¶42, determine and highlighting identified primary speaker’s name). 
Regarding Claim 24, Sidi discloses wherein the speakers include one or more representatives of an organization and one or more customers of the organization with who the one or more representatives are conversing (¶43, audio data containing one customer service agent and two customers).
Regarding Claim 25, Sidi discloses wherein the instructions for determining the multiple speaker-identification parameters include: instructions for using the multiple groups of voice segments as one of the multiple speaker-identification parameters to identify a representative of the one or more representatives (¶39, identify customer service agent by comparing homogeneous speaker audio segments to voiceprint models). 
Regarding Claim 28, Beaumont discloses wherein the instructions for determining the multiple speaker-identification parameters include: instructions for generating, using the audio data and the video data, any of a transcript of the conversation, a facial image of the given speaker, an image of a setting at which the given speaker is located, or text data extracted from the video data as the speaker-identification parameters (¶17, using video data to perform facial recognition and correlate detected human faces with moving lips consistent with speaking with a voice stream during audio analysis; see also Woodward, Col 3, Rows 30-36, perform facial recognition, audio pattern matching, and metadata analysis in a combined audio visual speaker identification in which video patterns and pre-recorded audio patterns are used together to identify speakers).  
Regarding Claim 30, Beaumont does not disclose wherein the instructions for determining the identity include: instructions for comparing a facial image of the given speaker obtained from the video data with one or more facial images of the multiple speakers obtained from an image source to find a matching image, and instructions for determining the identity of the given speaker based on identification information associated with the matching image.  
Woodward discloses retrieving, using the metadata, one or more facial images of the multiple speakers from one or more sources (Col 3, Rows 39-41, facial recognition analysis of the video data for the segment may be used to identify the facial features of speakers in the video frame), comparing the facial image of the given speaker obtained from the video data with one or more facial images of the multiple speakers from an image source to find a matching image, and determining identity of the given speaker based on identification information associated with the matching image (Col 3, Rows 40-45, identified facial features of the speakers are compared to pictures obtained from social network service sources to identify particular individuals within the video frame / video segment).
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to modify Beaumont to compare the facial image of the speaker with the one or more facial images of the speakers to find a matching image in order to allow a user to obtain identification information of meeting participants in a conference (Woodward, Col 4, Rows 15-20).
Regarding Claim 32, Beaumont discloses wherein the real-time call data includes an online-based video conference meeting between the speakers (¶33 and ¶42, real time primary speaker identification / name highlighting in a web conferencing application).  
Claim 31 is rejected under 35 USC 103(a) as being unpatentable over Beaumont et al. (US 2015/0088515 A1) in view of Sidi et al. (US 2015/0025887 A1) and Woodward et al. (US 8983836 B2) as applied to Claim 23, in further view of Redfern (US 2014/0330566 A1) and Jing et al. (US 8131118 B1).
Regarding Claim 31, Beaumont does not disclose wherein determining the identity includes: retrieving images of multiple settings from one or more sources, each of the images of the settings associated with information that can be used to identify one or more of the multiple speakers, comparing the image of the setting with the images of the settings to find a matching image, and determining the identity of the given speaker based on information associated with the matching image.  
Redfern discloses determining identification information based on settings associated with information that can be used to identify one or more speakers in a conversation (Abstract and ¶18, using location of an individual as context information to access a social graph to identify the identity of a speaker).
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to modify Beaumont-Woodward to determine the identification information by retrieving information of multiple settings from one or more sources, each of the settings associated with information that can be used to identify one or more speakers in order to use the settings as context information to determine the identity of an individual (Redfern, ¶18; see Woodward, Col 12, Rows 13-25 in view of Col 3, Rows 50-55, metadata of video and audio content includes geographical locations, time / date information to indicate organization /event associated with the multimedia content to pre-filter user profiles having some affiliation with the organization or event).
Further, Jing discloses a system for determining the location / setting where an image was captured (Abstract) by retrieving images of multiple settings from one or more sources and comparing a captured image of a setting with the retrieved images of the settings to find a matching image (Col 1, Rows 35-45, capture one or more images from a location and compare the submitted images to images in an image library to identify matches).
 It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to modify Beaumont to capture images of a setting and compare the captured images of the setting with retrieved images of multiple settings in a library to provide location information of the captured images (Jing, Col 1, Rows 51-53). In this manner, Beaumont-Woodward modified according to Redfern may use the location information (Woodward, Col 3, Rows 50-51) associated with the matching image as context information (i.e., each of the images of the settings associated with information that can be used to identify one or more of the speakers) to determine identification information of a speaker (Redfern, ¶18).
Claims 15-18 are rejected under 35 USC 103(a) as being unpatentable over Beaumont et al. (US 2015/0088515 A1) in view of Sidi et al. (US 2015/0025887 A1) and Ramaswamy et al. (US 6490560 B1) as applied to Claim 1, in further view of Gorthi et al. (US 9462102 B1).
Claim 29 is rejected under 35 USC 103(a) as being unpatentable over Beaumont et al. (US 2015/0088515 A1) in view of Sidi et al. (US 2015/0025887 A1) and Woodward et al. (US 8983836 B2) as applied to Claim 23, in further view of Gorthi et al. (US 9462102 B1).
Regarding Claim 15, Beaumont does not disclose extracting, from the video data associated with the given speaker, text data serves as another the multiple speaker-identification parameters.  
Gorthi discloses a system recording voice and video calls / conversation between participants (Abstract) to determine multiple speaker identification parameters by (Claim 15) extracting text data / transcript as one of the speaker-identification parameters from video data associated with a given speaker and analyzing the transcript of the conversation to determine the identity of the given speaker (Col 3, Rows 60-61, obtain media sample of video call; Col 4, Rows 18-20, utilize speech recognition to transcribe audio segment of the call sample; Col 5, Rows 10-15, compare words within the transcript with names of existing contacts to identify a caller as an existing contact);
(Claims 16 and 29) wherein determining the identity of the given speaker from the conversation including (a) the identity of the speaker provided by one of the given speakers and (b) information that can be used to derive the identity (Col 4, Rows 44-50, call context program 114 stores the recorded and transcribed content of the media sample as call metadata) includes: comparing the text data with information of multiple speakers from metadata to find a matching speaker, and determining the identity of the given speaker based on information associated with the matching speaker (Col 5, Rows 10-15, call context program 114 compares words within the transcript with names of existing contacts of the user to determine whether a speaker is an existing contact);  
 (Claim 17) wherein comparing the text data includes: comparing the text data with contact information of the multiple speakers available from the metadata to the find the matching speaker (Col 5, Rows 10-15, call context program 114 compares words within the transcript with names of existing contacts of the user to determine whether a speaker is an existing contact);  
(Claim 18) wherein comparing the text data includes: comparing the text data with contact information of the multiple speakers available from an employee directory of an organization, or from a source external to the organization to the find the matching speaker (Col 5, Rows 10-15, a user’s existing contacts).  
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to modify Beaumont to extracting, from the video data associated with the speaker, text data as one of the speaker-identification parameters in accordance to Gorthi in order to create or edit contact information associated with participants of video call based on transcribed content (Gorthi, Col 1, Rows 35-38).
Claims 26-27 are rejected under 35 USC 103(a) as being unpatentable over Beaumont et al. (US 2015/0088515 A1) in view of Sidi et al. (US 2015/0025887 A1) and Woodward et al. (US 8983836 B2) as applied to Claim 25, in further view of Tritschler et al. (US 6424946 B1).
Beaumont does not disclose wherein the instructions for determining the identity include: instructions for comparing each group of voice segments with multiple fingerprints of multiple representatives of an organization stored in a data storage system to identify a matching fingerprint, and instructions for determining a speaker associated with a group of voice segments that matched with the matching fingerprint as a representative associated with the matching fingerprint.  
Tritschler discloses a system for identifying speakers participating in audio-video source (Abstract and Col 4, Rows 1-11) comprising generating a fingerprint with a user Col 6, Rows 26-40, compute feature vectors from audio of a speaker to be identified; Col 6, Row 64 – Col 7, Row 30, for unenrolled / unknown speaker, apply the feature vectors in a clustering process to merge the feature vectors with previously identified clusters or to create a new cluster) and storing the fingerprint in a storage system (Col 7, Rows 25-30, assign a cluster identifier and record the cluster in a speaker turn database 300).
(Claim 26) When identifying speakers from real time call data (Col 3, Rows 36-39 and Col 4, Rows 27-30), the system compares a group of voice segments associated with a speaker with multiple fingerprints of a group of speakers stored in a data storage system (Col 6, Row 65 – Col 7, Row 9 and Col 8, Rows 16-26, segmentation process for identifying segment boundary between non-homogeneous speech portions to extract homogenous segment corresponding to a single speaker; Col 11, Rows 30-40, speaker identification process 700 receives turns identified by the segmentation process 600 together with the feature vectors to compare the segment utterances to speaker database 420 and finds the closest speaker), identifying one of the fingerprints that matches with one or more segments of the group of voice segments to determine a speaker associated with the group of voice segments that matched with the matching fingerprint as a speaker associated with the matching fingerprint (Col 11, Rows 30-40, find the “closest” speaker from the speaker database 420), (Claim 27) identifying the group of voice segments in the real-time call data at which the speaker spoke during the conversation using the matching fingerprint (Col 11, Rows 30-40 in view of Col 7, Rows 63-65, identifying speaker name of the “closest speaker”), and assigning identification information associated with the matched speaker to Col 11, Rows 30-40, assign the “closest” speaker from the speaker database 420 as label to each segment);
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to modify Beaumont to compare audio features in voice segments to fingerprint in a data storage system as taught by Tritschler in order to implement a speaker identification system (Tritschler, Col 7, Rows 30-39 and Beaumont, ¶29, speaker recognition engine). As a result, it is possible to match a speaker of the group of voice segments with a fingerprint of a representative of an organization (Young, Abstract, call center agent) stored in a data storage system.
Claim 33 is rejected under 35 USC 103(a) as being unpatentable over Beaumont et al. (US 2015/0088515 A1) in view of Sidi et al. (US 2015/0025887 A1) and Woodward et al. (US 8983836 B2) as applied to Claim 23, in further view of Lamb et al. (US 2014/0006026 A1).
Beaumont does not disclose wherein real-time call data includes any of a virtual reality-based or augmented reality-based conversation between the multiple speakers.
Lamb discloses an augmented-virtual reality system for generating enhanced audio signals for a head mounted display device in an augmented reality environment where an end user views a real word environment along with projected images of virtual objects (Abstract and ¶35). The system differentiates between different sound sources based on context (Abstract and ¶84).
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to modify Beaumont to perform speaker identification in a virtual reality-based or augmented reality based conversation between speakers to provide augmented real word environment where perception is augmented with computer generated Lamb ¶1) such as generating enhanced audio signals such that one or more sound levels corresponding with sounds received from one or more sources of sound within an environment may be dynamically adjusted based on contextual information (Lamb, Abstract). 
Claims 36-38 are rejected under 35 USC 103(a) as being unpatentable over Beaumont et al. (US 2015/0088515 A1) in view of Sidi et al. (US 2015/0025887 A1), Kashtan et al. (US 2016/0275952 A1), Farrell (US 9860282 B2) and Patel et al. (US 2008/0088698 A1).
Regarding Claim 36, Beaumont discloses a computer-implemented method comprising: 
receiving a recording (¶28, capturing audio and video data) of a conversation between multiple speakers, wherein the recording includes voice data, video data (¶1, ¶42, web conferencing where users communicate voice data and video feeds) and metadata that provides information regarding at least one speaker of the multiple speakers (¶29 and ¶36, audio/video data of a crowed audio environment where there are at least two human speakers; ¶32, video data and audio data contain time stamps; ¶33, audio data contains therein directionality information related to a speaker that may be leveraged in the analysis of video data); 
generating multiple voice segments by analyzing the voice data, wherein each voice segment of the multiple voice segments represents a portion of the voice data that corresponds to a single speaker (¶37 and ¶39, identifying a first portion of audio + video with aligned timestamps corresponding to a first primary speaker and identifying a second portion of audio + video with aligned timestamps corresponding to a second primary speaker; ¶42, isolating the primary speaker’s audio data input from other speakers / noise); 
forming multiple sets of voice segments by grouping the multiple voice segments by speaker, wherein each set of voice segments is associated with a different speaker of the multiple speakers (¶39, when two or more human speakers take turns talking, for a first match, identify a first primary speaker by matching audio data with video data containing visual features associated with speech followed by identifying another primary speaker by matching subsequent portion of audio data and video data; ¶42, isolating the identified primary speaker’s audio data input from other speakers / noise; i.e., isolating the first primary speaker’s audio data from the second primary speaker’s audio data when two or more human speakers take turns talking means isolating audio data inputs from the first primary speaker into a first group and isolating audio data inputs from the second primary speaker into a second group); 
generating, based on the video data, a facial image of a speaker for the purpose of establishing an identity of the speaker (¶38-39, capturing and matching video data with visual features such as moving mouth and lip with audio data to identify a first primary speaker followed by identifying a second primary speaker by matching subsequent portion of audio data and video data).
Beaumont does not disclose wherein each voice segment corresponds to a portion of the recording that includes a voice of one of the multiple speakers, a first voice segment including the voice of one of the multiple speakers in a conversation with a first customer and a second voice segment including the voice of one of the multiple speakers in a conversation with a second customer.
Sidi discloses a server computer (¶20 and ¶59, Fig. 3, computer system 300 / centralized server) generating multiple sets of voice segments from a recording includes voice data and metadata (¶27-28, audio data 102 comprises real time streaming audio data and metadata 108 / identification data) wherein each voice segment corresponds to a portion of the recording that includes a voice of one of the multiple speakers, a first voice segment including the voice of one of the multiple speakers in a conversation with a first customer and a second voice segment including the voice of one of the multiple speakers in a conversation with a second customer (¶36 and ¶¶38-39, cluster speech segments into groups of speech segments having the same speaker, compare the clustered groups to speech models of known agents or customers, and identify homogeneous speaker segments as customer service agent or customers; ¶43, identify and separate speakers in audio data that contain one customer service agent and two customers; i.e., in the clustered group of speech segments tagged as customer service agent, there is at least a first speech segment of the agent speaking to a first one of the two customers and at least a second speech segment of the agent speaking to a second one of the two customers). 
Beaumont requires speaker recognition identifying portions of the audio data containing potential human speakers while not knowing which is a human speaker (¶37) and thereafter requires highlighting identified primary speaker’s name in web conferencing application (¶42). 
Sidi provides a process of blind diarization separating speaker segments in audio data of one customer service agent and two customers (Sidi, ¶35 and ¶43) similar to the speaker Beaumont and thereafter performing speaker diarization separating audio data into homogeneous speech segments comprising at least a first speech segment of the agent speaking to a first customer and a second speech segment of the agent speaking to the second customer (Sidi, ¶38-39) that fulfills Beaumont’s need to identify a speaker’s name (Beaumont, ¶42). 
Therefore, it would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to implement the speaker diarization process of Sidi to identify first voice segment of the voice of the agent in a conversation with a first customer and a second voice segment of the voice of the agent in another conversation with the second customer in order to identify at least one speaker in the audio data (Sidi, ¶43; compare Beaumont, ¶42, identify the primary speaker’s name).
Beaumont does not disclose obtaining contact information for each speaker of multiple speakers to retrieve at least one facial image of each speaker of the multiple speakers from a social networking service.
Kashtan discloses determining identity of a current speaker by obtaining metadata comprising contact information and image features for the current speaker and comparing metadata with metadata associated with stored audio fingerprints to facilitate speaker recognition (¶107-108; in view of ¶44, metadata may include contact information for the current speaker).
Farrell teaches a cloud service provider configured to use image data of detected face of a person of interest appearing in video to verify an identity of the person of interest (Col 4, Rows 6-18) by sending a request to the cloud service provider comprising an image feature Col 3, Rows 31-56), the cloud service provider retrieves at least one facial image of the person of interest from a social networking service (Col 6, Rows 6-18, cloud service provider performs face detection on the image feature attached to the request and identify the person of interest from a large online database a profile picture of registered user; in view of Col 2, Rows 55-60 and Col 4, Rows 65-67, cloud service provider operates on behalf of one or more of social network provider 150 and content provider 160) wherein each person of interest has a corresponding contact information associated with a profile from which the at least facial image is retrieved (Col 4, Rows 19-25, profile information associated with that person comprises a communications address of the person or alternative contact addresses). 
The cloud service provider compares the facial image of the speaker to the facial images retrieved from the social networking service to find a matching image and establishing the identity of the speaker based on identification information associated with the matching image (Col 4, Rows 6-18, and Col 10, Rows 30-41, cloud service provider uses image data of the detected face of person interest appearing in the video / image to verify an identity of the person of interest by implementing a face recognition system based on convolutional neural networks to identify the person of interest from a large online database a profile pictures of registered users).
Kashtan suggests obtaining a request for identifying a current user may comprise obtaining metadata such as contact information and image features of the current user (¶107-108, computer system 310 receives metadata generated by client devices 301-305; in view of ¶44, metadata comprises contact information for current speaker and image features of current speaker) and comparing the metadata to metadata associated with stored ¶108). Ferrell suggests such contact information / metadata associated with stored audio fingerprints are associated with profile pictures of the current user (Col 4, Rows 15-25). 
It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to implement Beaumont’s speaker identification analysis (Beaumont, ¶29, more robust analyses, e.g., speaker identification) to generate a request by identifying a person of interest by detecting human face image features (Beaumont, ¶17; Ferrell, Col 5, Rows 56-67, face detection module detects face and generate bounding box) and contact information as speaker identification parameters / descriptive metadata (Kashtan, ¶44, ¶107) and determine an identity of the speaker by comparing the descriptive metadata / contact information to metadata / contact information associated with social network profiles as taught by Ferrell to extract face image features of person of interest stored in the social network profiles for comparison with face image feature in the descriptive metadata in order to assign identification information indicative of the identity of the speaker to a set of voice segments associated with the speaker (Beaumont, ¶42, determine identified primary speaker’s name in order to be highlighted; Kashtan, ¶108, compare received descriptive metadata with metadata associated with stored audio fingerprints to facilitate speaker recognition; Ferrell, Col 4, Rows 6-18, obtain social network profile pictures for comparison with detected face of person of interest to verify an identity of the person of interest). 
The combination does not disclose obtaining contact information for each speaker of the multiple speakers that is -8- (PATENT)extracted or derived from corresponding invitations to the conversation.
Patel teaches a video conferencing system obtaining contact information for each speaker of multiple speakers for a video conference that is extracted or derived from corresponding invitations to the video conference (¶32, video conferencing system where a video conference participant dials into a conference session; ¶34, inviting guests, vendors, or contractors to participate and requests / queries each such participant to provide contact information (e.g., name, street address, email address, phone number) and to capture each participant’s facial image for entry into a corporate database).
 It would’ve been obvious to one ordinarily skilled in the art before the effective filing date of the invention to derive or extract contact information for each speaker from corresponding invitation to the conversation in order to allow the system to add new entries corresponding to new participants (Patel, ¶34).
  Regarding Claim 37, Beaumont discloses wherein the identification information is retrieved from the social networking service (¶19, database may store facial eigenvectors along with the profile of each person; ¶32, once the system matches or recognizes a person’s facial image in a meeting, the identity information of that person is attached to an object associated with that image).  
Regarding Claim 38, Kashtan discloses wherein the contact information is extracted or derived from the metadata, wherein the contact information includes an email identification or a name for each speaker of the multiple speakers (¶43-44, metadata identifying the current speaker includes name / identity of an individual and contact information for the currents speaker; ¶146, when implemented in an e-mail application, the contact information would be an email address).

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to examiner Richard Z. Zhu whose telephone number is 571-270-1587 or examiner’s supervisor King Poon whose telephone number is 571-272-7440. Examiner Richard Zhu can normally be reached on M-Th, 0730:1700.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/RICHARD Z ZHU/Primary Examiner, Art Unit 2675                                                                                                                                                                                                        01/13/2022