333Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 1/25/2022 has been entered.
 
Response to Arguments
Applicant's arguments filed 1/25/2022 have been fully considered but they are not persuasive. 
The applicant contends
Applicant respectfully submits that Diamant at least fails to anticipate Applicant's  claimed "defining one or more speaker representations based upon, at least in part, the acoustic metadata associated with the audio encounter information and the information associated with the acoustic environment, wherein the one or more speaker representations include spatial information and spectral information clustered together for each speaker" and/or 
"labeling one or more portions of the audio encounter information with the one or more speaker representations, including the spatial information and the spectral information clustered together for each speaker, and labeling the one or more portions of the audio encounter information with a speaker location within the acoustic environment for each speaker, wherein labeling the one or more portions of the audio encounter information with the one or more speaker representations and the speaker location within the acoustic environment includes determining where the speaker location within the acoustic environment is compared to a base of the first microphone system."
For example, while Diamant may or may not generally recite speaker representations and speaker location, Diamant is not understood to teach that the speaker representations include both spatial information and spectral information clustered together for each speaker. Moreover, even assuming arguendo that Diamant teaches "labeling" the audio encounter information with the broader interpretation of Applicant's claimed "speaker representations", Diamant is not onment, wherein the one or more speaker representations include spatial information and spectral information clustered together for each speaker" and 
"labeling one or more portions of the audio encounter information with the one or more speaker representations, including the spatial information and the spectral information clustered together for each speaker, and labeling the one or more portions of the audio encounter information with a speaker location within the acoustic environment for each speaker, 
wherein labeling the one or more portions of the audio encounter information with the one or more speaker representations and the speaker location within the acoustic environment includes determining where the speaker location within the acoustic environment is compared to a base of the first microphone system." 
Because of the foregoing discussion, inter alia, Applicant respectfully contends that Diamant at least fails to anticipate the claimed features under 35 U.S.C. § 102 because of the absence from Diamant of Applicant's claimed "defining one or more speaker representations based upon, at least in part, the acoustic metadata associated with the audio encounter information and the information associated with the acoustic environment, 
wherein the one or more speaker representations include spatial information and spectral information clustered together for each speaker" and/or 
"labeling one or more portions of the audio encounter information with the one or more speaker representations, including the spatial information and the spectral information clustered together for each speaker, and labeling the one or more portions of the audio encounter information with a speaker location within the acoustic environment for each speaker, wherein labeling the one or more portions of the audio encounter information with the one or more speaker representations and the speaker location within the acoustic environment includes determining where the speaker location within the acoustic environment is compared to a base of the first microphone system" similarly recited in independent claims 1, 8, and 15. Accordingly, Applicant respectfully requests that the rejection be withdrawn.

	The examiner disagrees. The interview summary and office action clearly explains the correlation between the recited limitations and the prior art reference. The applicant’s arguments merely repeat the recited limitations and provide no further explanation as to why the applicant believes the prior art reference fails to disclose the recited limitation. Furthermore, the applicant’s remarks merely makes a general comment on the limitation “speaker representations”, but fails to provide explanation as the distinction between the speaker representations as recited and the disclosure of the prior art reference as indicted in the office action (previous as well as below.).
.

Claim Rejections - 35 USC § 102
The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale, or otherwise available to the public before the effective filing date of the claimed invention.

Claim(s) 1-20 is/are rejected under 35 U.S.C. 102a1 as being anticipated by Diamant et al (US Publication No.: 20190341050).
Claim 1, Diamant et al discloses
receiving information associated with an acoustic environment (Fig. 1a, label 100 shows an acoustic environment such as a conversation occurring around microphone system, label 108,106. Fig. 6, label face location machine outputs the locations of candidate faces of the individuals that are in the acoustic environment such as shown in Fig. 1a, label 100.);
receiving acoustic metadata associated with audio encounter information received by a first microphone system (Fig. 6, label 166,168 outputs metadata such as an image or video of the participant’s face (paragraph 31 discloses a camera that records a video including the participant’s face.), wherein such metadata is associated with the conversation or audio encounter information occurring as shown in Fig. 1a. Fig. 4, label 114 shows the metadata provide for the participants who are part of the acoustic 
defining one or more speaker representations (Fig. 4, label 114, 102a-c shows the video or the speakers. Fig. 4, label 114,166a-c shows the speaker representations that includes the image of the participant’s face as well as frame or border around the participant’s face. Fig. 5 also shows one or more speaker representations is generated by defining the name of the speakers, wherein such name of the speakers is generated based on the image of the participant’s face with frame or border around the participant’s face. Fig. 7, label 606 shows the audio that accompanies the video or image of the person speaking. The frames, 604, of the audio signal indicates a representation of each speaker’s audio or speech.) based upon, at least in part, the acoustic metadata associated with the audio encounter information (Fig. 4, label 144 shows the one or more speaker representations is defined by label Face 1-3, wherein the image or video of the participant’s face is considered the metadata. Fig. 7, 604 indicates the audio of the person speaking that correlates with the image or vide of the participant’s face. Such is considered metadata used to determine WHO is speaking.) and the information associated with the acoustic environment (Fig. 4, label 116a-c includes the location of the speaker in the acoustic environment shown in Fig. 1a. Fig. 9 shows how the degree or location of the speaker is determined. Fig. 7, label where and angle indicates the information associated with the acoustic environment such as the location of the speaker.),

labeling one or more portions of the audio encounter information with the one or more speaker representations (Fig. 7, label 608a-c shows the labels of one or more portions of the audio encounter information (label 606) or conversation occurring as shown in Fig. 1a. Label 608a-c includes the name of the speaker/participant, which are associated with the one or more speaker representations as shown in Fig. 5, label 116a-c, Fig. 4, label 114.) and a speaker location within the acoustic environment (label 608a-c includes a degree, which indicates the location of the speaker within the acoustic environment as shown in Fig. 1a, label 100. Fig. 9, label 106 shows the microphone system, label 102a-c shows the speakers or participants with corresponding degree to the microphone, indicating the location of the speaker.),

labeling the one or more portions of the audio encounter information with a speaker location within the acoustic environment for each speaker (Fig. 4, label 166,Fig. 7, label 608 shows the labeling of one or more portions of audio encounter information (such as participant’s face and the participant’s speech or audio) with a speaker location (label angle, WHERE), within the acoustic environment (conversation occurring between participants shown in Fig. 1a, 9.), 
wherein labeling the one or more portions of the audio encounter information (Fig. 7, label who,where,when indicates the labeling of the audio encounter information, labels 606) with the one or more speaker representations (Fig. 7, label Bob,Charlie,Alice indicates the name of the speaker or participants which is associated with corresponding image of the participant shown in Fig. 4,5.) and the speaker location within the acoustic environment (Fig. 7, label degree, indicates the location of the speaker. Fig. 9 shows the location as the degree from which the speaker is compared to the conference assistant, 106 that includes the microphones, 108.), the speaker location includes determining where the speaker location within the acoustic environment compared to a base of the first microphone system (Fig. 9 shows the location as the degree from which the speaker is compared to the conference assistant, 106 that includes the microphones, 108, where the conference system or diarization machine 132 is the base of the microphone system. Paragraph 45 discloses the location of the candidate face (such as degrees) from the diarization machine.).

Claim 3, Diamant et al discloses receiving visual metadata associated with one or more encounter participants within the acoustic environment (Fig. 1a shows the acoustic environment or area with the microphone where the conversation is occurring. Fig. 4, label 114 shows the visual metadata, 166a-c along with text based metadata of the participants/speakers within the acoustic environment shown in Fig. 1a.).
Claim 4, Diamant et al discloses defining the one or more speaker representations includes defining the one or more speaker representations based upon, at least in part, the visual metadata associated with the one or more encounter participants within the acoustic environment. (Fig. 5, label 116a-c shows the one or more speaker representations are defined by visual metadata or the image or clip from the video as shown in Fig. 4. Fig. 4, label 102a-c shows the one or more encounter participants, label 166a-c are the visual metadata associated with the one or more encounter participants within the acoustic environment, Fig. 1a, label 100.).
Claim 5, Diamant et al discloses receiving weighting metadata (Paragraph 32 discloses confidence levels are determined for the metadata, such as the image or 
Claim 6, Diamant et al discloses defining the one or more speaker representations (Fig. 3, label 166a-c,114, Fig. 5, label 166a-c) includes defining the one or more speaker representations based upon, at least in part, the weighting metadata associated with the audio encounter information received by the second microphone system. (Paragraph 32 discloses using the weighting metadata or confidence level to determine the identity or name of the one or more speaker representations. This is shown in Fig. 9, label 102a-c, Fig. 5, label 168a-c.)
Claim 7, Diamant et al discloses defining the one or more speaker representations includes defining one or more of: at least one known speaker representation and at least one unknown speaker representation (Fig. 5 label 166a-c shows the one or more speaker representations as undefined speakers. Paragraphs 32-34 discloses multiple implementations of label 124/126. This indicates defining one or more speaker representations includes defining at least one known representation due the use of memories or training data (paragraph 33-34) and at least one unknown speaker representation by using confidence level to define the unknown speaker representation (paragraph 32).).
Claim 8, Diamant et al discloses

receiving acoustic metadata associated with audio encounter information received by a first microphone system (Fig. 6, label 166,168 outputs metadata such as an image or video of the participant’s face (paragraph 31 discloses a camera that records a video including the participant’s face.), wherein such metadata is associated with the conversation or audio encounter information occurring as shown in Fig. 1a. Fig. 4, label 114 shows the metadata provide for the participants who are part of the acoustic environment as shown in Fig. 1a. Fig. 5 label 168a-c shows additional metadata including the name of the speaker or participant. The metadata includes voice identification, label 128, in association with the beamforming zone and beamformed signal. Fig. 1a shows the microphone system, label 106, wherein one microphone can be considered a first microphone system.);
defining one or more speaker representations (Fig. 4, label 114, 102a-c shows the video or the speakers. Fig. 4, label 114,166a-c shows the speaker representations that includes the image of the participant’s face as well as frame or border around the participant’s face.) based upon, at least in part, the acoustic metadata associated with the audio encounter information (Fig. 4, label 144 shows the one or more speaker representations is defined by label Face 1-3, wherein the image or video of the participant’s face is considered the metadata.) and the information associated with the acoustic environment (Fig. 4, label 116a-c includes the location of the speaker in the 
wherein the one or more speaker representations include spatial information (Fig. 4, label angle indicates the location of the participant’s face as the person appears in the video, label 114. Paragraph 31 discloses face location may include coordinates of a bounding box around a located face image, a portion of the digital image where the face was located, other location information, etc. Fig. 7, label 608, WHERE indicates the location or spatial information of the candidate and speaker. The angle indicates the location or spatial information.) and spectral information (Fig. 7, label 608, WHEN indicates the spectral information such as the time of when the participant and speaker is speaking. Fig. 4, label 166 shows the face of the participant in the digital video, 114. Paragraph 31 discloses the spectral information such as facial image of the candidate in the video.) clustered together for each speaker (Fig. 3, label 132, Paragraph 46-50 discloses clustering WHO, WHERE, WHEN a participant has spoken, such as max P(who, angle| audio, video) (paragraph 50).); and
labeling one or more portions of the audio encounter information with the one or more speaker representations (Fig. 7, label 608a-c shows the labels of one or more portions of the audio encounter information (label 606) or conversation occurring as shown in Fig. 1a. Label 608a-c includes the name of the speaker/participant, which are associated with the one or more speaker representations as shown in Fig. 5, label 116a-c, Fig. 4, label 114.) and a speaker location within the acoustic environment (label 608a-c includes a degree, which indicates the location of the speaker within the acoustic environment as shown in Fig. 1a, label 100. Fig. 9, label 106 shows the microphone 
including the spatial information and the spectral information clustered together for each speaker (Fig. 7, label 608,Fig. 4, label 166, paragraphs 46-50 discloses clustering of spatial and spectral information for each speaker.), and 
labeling the one or more portions of the audio encounter information with a speaker location within the acoustic environment for each speaker (Fig. 4, label 166,Fig. 7, label 608 shows the labeling of one or more portions of audio encounter information (such as participant’s face and the participant’s speech or audio) with a speaker location (label angle, WHERE), within the acoustic environment (conversation occurring between participants shown in Fig. 1a, 9.), 
wherein labeling the one or more portions of the audio encounter information (Fig. 7, label who,where,when indicates the labeling of the audio encounter information, labels 606) with the one or more speaker representations (Fig. 7, label Bob,Charlie,Alice indicates the name of the speaker or participants which is associated with corresponding image of the participant shown in Fig. 4,5.) and the speaker location within the acoustic environment (Fig. 7, label degree, indicates the location of the speaker. Fig. 9 shows the location as the degree from which the speaker is compared to the conference assistant, 106 that includes the microphones, 108.), the speaker location includes determining where the speaker location within the acoustic environment compared to a base of the first microphone system (Fig. 9 shows the location as the degree from which the speaker is compared to the conference assistant, 106 that includes the microphones, 108, where the conference system or diarization machine 132 is the base of the microphone system. 
Claim 9, Diamant et al discloses the acoustic metadata associated with the audio encounter information includes voice activity information (Fig. 3, label 150 shows the voice activity information by including the audio or waveform or signal of the speaker/participant speaking.) and signal location associated with the audio encounter information (Fig. 3, label 150 shows the audio encounter information in the form waveform or signal, wherein such waveform or signal is associated with a specific speaker as indicated by label 128,170. Such metadata is passed to label 132 from label 128. Such audio encounter information is also shown in Fig. 7 with associated speaker label.).
Claim 10, Diamant et al discloses receiving visual metadata associated with one or more encounter participants within the acoustic environment (Fig. 1a shows the acoustic environment or area with the microphone where the conversation is occurring. Fig. 4, label 114 shows the visual metadata, 166a-c along with text based metadata of the participants/speakers within the acoustic environment shown in Fig. 1a.).
Claim 11, Diamant et al discloses defining the one or more speaker representations includes defining the one or more speaker representations based upon, at least in part, the visual metadata associated with the one or more encounter participants within the acoustic environment (Fig. 5, label 116a-c shows the one or more speaker representations are defined by visual metadata or the image or clip from the video as shown in Fig. 4. Fig. 4, label 102a-c shows the one or more encounter participants, label 166a-c are the visual metadata associated with the one or more encounter participants within the acoustic environment, Fig. 1a, label 100.).

Claim 13, Diamant et al discloses defining the one or more speaker representations (Fig. 3, label 166a-c,114, Fig. 5, label 166a-c) includes defining the one or more speaker representations based upon, at least in part, the weighting metadata associated with the audio encounter information received by the second microphone system. (Paragraph 32 discloses using the weighting metadata or confidence level to determine the identity or name of the one or more speaker representations. This is shown in Fig. 9, label 102a-c, Fig. 5, label 168a-c.)
Claim 14, Diamant et al discloses defining the one or more speaker representations includes defining one or more of: at least one known speaker representation and at least one unknown speaker representation (Fig. 5 label 166a-c shows the one or more speaker representations as undefined speakers. Paragraphs 32-34 discloses multiple implementations of label 124/126. This indicates defining one or more speaker representations includes defining at least one known representation due the use of memories or training data (paragraph 33-34) and at least one unknown 
Claim 15, Diamant et al discloses
a memory (paragraph 92); and
a processor (paragraph 90,93) configured to 
receive information associated with an acoustic environment (Fig. 1a, label 100 shows an acoustic environment such as a conversation occurring around microphone system, label 108,106. Fig. 6, label face location machine outputs the locations of candidate faces of the individuals that are in the acoustic environment such as shown in Fig. 1a, label 100.),
wherein the processor is further configured to receive acoustic metadata associated with audio encounter information received by a first microphone system (Fig. 6, label 166,168 outputs metadata such as an video image of the participants face (paragraph 31 discloses a camera that records a video including the participant’s face.) wherein such metadata is associated with the conversation or audio encounter information occurring as shown in Fig. 1a. Fig. 4, label 114 shows the metadata provide for the participants who are part of the acoustic environment as shown in Fig. 1a. Fig. 5 label 168a-c shows additional metadata including the name of the speaker or participant. The metadata includes voice identification, label 128, in association with the beamforming zone and beamformed signal. Fig. 1a shows the microphone system, label 106, wherein one microphone can be considered a first microphone system.),
wherein the processor is further configured to define one or more speaker representations (Fig. 4, label 114, 102a-c shows the video or the speakers. Fig. 4, label 114,166a-c shows the speaker representations that includes the image of the 
wherein the one or more speaker representations include spatial information (Fig. 4, label angle indicates the location of the participant’s face as the person appears in the video, label 114. Paragraph 31 discloses face location may include coordinates of a bounding box around a located face image, a portion of the digital image where the face was located, other location information, etc. Fig. 7, label 608, WHERE indicates the location or spatial information of the candidate and speaker. The angle indicates the location or spatial information.) and spectral information (Fig. 7, label 608, WHEN indicates the spectral information such as the time of when the participant and speaker is speaking. Fig. 4, label 166 shows the face of the participant in the digital video, 114. Paragraph 31 discloses the spectral information such as facial image of the candidate in the video.) clustered together for each speaker (Fig. 3, label 132, Paragraph 46-50 discloses clustering WHO, WHERE, WHEN a participant has spoken, such as max P(who, angle| audio, video) (paragraph 50).); and
wherein the processor is further configured to label one or more portions of the audio encounter information with the one or more speaker representations (Fig. 7, label 608a-c shows the labels of one or more portions of the audio encounter information (label 606) or conversation occurring as shown in Fig. 1a. Label 608a-c includes the 
including the spatial information and the spectral information clustered together for each speaker (Fig. 7, label 608,Fig. 4, label 166, paragraphs 46-50 discloses clustering of spatial and spectral information for each speaker.), and 
label the one or more portions of the audio encounter information with a speaker location within the acoustic environment for each speaker (Fig. 4, label 166,Fig. 7, label 608 shows the labeling of one or more portions of audio encounter information (such as participant’s face and the participant’s speech or audio) with a speaker location (label angle, WHERE), within the acoustic environment (conversation occurring between participants shown in Fig. 1a, 9.), 
wherein labeling the one or more portions of the audio encounter information (Fig. 7, label who,where,when indicates the labeling of the audio encounter information, labels 606) with the one or more speaker representations (Fig. 7, label Bob,Charlie,Alice indicates the name of the speaker or participants which is associated with corresponding image of the participant shown in Fig. 4,5.) and the speaker location within the acoustic environment (Fig. 7, label degree, indicates the location of the speaker. Fig. 9 shows the location as the degree from which the speaker is compared to the conference assistant, 106 that includes the microphones, 108.), the speaker location includes determining 
Claim 16, Diamant et al discloses the acoustic metadata associated with the audio encounter information includes voice activity information (Fig. 3, label 150 shows the voice activity information by including the audio or waveform or signal of the speaker/participant speaking.) and signal location associated with the audio encounter information (Fig. 3, label 150 shows the audio encounter information in the form waveform or signal, wherein such waveform or signal is associated with a specific speaker as indicated by label 128,170. Such metadata is passed to label 132 from label 128. Such audio encounter information is also shown in Fig. 7 with associated speaker label.).
Claim 17, Diamant et al discloses receiving visual metadata associated with one or more encounter participants within the acoustic environment (Fig. 1a shows the acoustic environment or area with the microphone where the conversation is occurring. Fig. 4, label 114 shows the visual metadata, 166a-c along with text based metadata of the participants/speakers within the acoustic environment shown in Fig. 1a.).
Claim 18, Diamant et al discloses defining the one or more speaker representations includes defining the one or more speaker representations based upon, at least in part, the visual metadata associated with the one or more encounter participants within the acoustic environment. (Fig. 5, label 116a-c shows the one or more 
Claim 19, Diamant et al discloses receiving weighting metadata (Paragraph 32 discloses confidence levels are determined for the metadata, such as the image or speaker representation’s identity. Such confidence level is a weighted metadata assessing the one or more speaker representations and the identity of the speaker. Fig. 1a,9 shows the speakers/participants associated with the audio encounter information.) associated with the audio encounter information received by at least a second microphone system (Fig. 1a, label 106, Fig. 9, label 106 indicates one or more microphones or microphone systems that receive audio encounter information or audio from the conversation occurring between the participants/speakers.).
Claim 20, Diamant et al discloses defining the one or more speaker representations (Fig. 3, label 166a-c,114, Fig. 5, label 166a-c) includes defining the one or more speaker representations based upon, at least in part, the weighting metadata associated with the audio encounter information received by the second microphone system (Paragraph 32 discloses using the weighting metadata or confidence level to determine the identity or name of the one or more speaker representations. This is shown in Fig. 9, label 102a-c, Fig. 5, label 168a-c.).




Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to LINDA WONG whose telephone number is (571)272-6044. The examiner can normally be reached 9-5.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Andrew Flanders can be reached on (571) 272-7516. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/LINDA WONG/Primary Examiner, Art Unit 2655                                                                                                                                                                                                        33