DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statement (IDS) submitted on 06/03/2022 is in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.

Response to Amendment
The amendments filed on May 18, 2022 have been entered.
Claims 1-4 and 8-13, and 15-20 have been amended.
Claims 5-7 have been canceled. 
Claims 21-23 have been added. 

         Response to Arguments
Applicant’s arguments filed on May 18, 2022 have been considered but are not persuasive. 

Applicant’s argument 1:
Sinkov, whether considered singly or in combination with the other cited references, fails to describe, teach, or suggest each limitation recited by independent claims 1, 11, and 17. For example, Sinkov, whether considered singly or in combination with the other cited references, fails to describe, teach, or suggest "determining that an action item comprising a task to be completed is associated with a first user of [a] first client device by determining, based on comparing [a] first plurality of volumes and [a] second plurality of volumes, that a segment of [] audio content that includes a description of the action item is contributed by the first user," as recited by currently amended independent claim 1 and as similarly recited by currently amended independent claims 11 and 17.
Examiners’ response to argument 1:
The examiners respectfully disagree. Sinkov discloses determining, based on comparing the first plurality of volumes and the second plurality of volumes, that a segment of the audio content is contributed by the first user (Parag. [0009] and Parag. [0017-0021]; (The art teaches determining which of a plurality of specific personal audio input audio devices correspond to which specific meeting participants, measuring volume levels at each of the personal audio input devices in response to each of the meeting participants speaking, identifying that a first particular one of the participants is speaking based on relative volume levels at each of the personal audio input devices. The art teaches that once the current speaker is identified, a particular smartphone of the speaker is marked by the system as a principal recording device and the system tracks a corresponding fragment of audio recording by that particular smartphone until a sufficiently long pause when the speaker either stopped talking to change the subject or for other reason or until the current speaker is replaced by another speaker. In either case, the fragment is picked by the system and added to the current channel of the speaker. Each channel of each speaker may therefore include subsequent fragments by a single speaker, uninterrupted by others and separated by pauses (such fragments may of course may be merged during post-processing of the meeting recording) or fragments separated in time by audio fragments from other speakers recorded in their channels)).
Sinkov doesn’t explicitly disclose determining that an action item comprising a task to be completed is associated with a first user of the first client device by determining that a segment of the audio content that includes a description of the action item is contributed by the first user.
However, Diamant discloses determining that an action item comprising a task to be completed is associated with a first user of the first client device by determining that a segment of the audio content that includes a description of the action item is contributed by the first user (Parag. [0052-0053], Parag. [0109], Parag. [0115], and Parag. [0138]; (The art teaches that FIG. 7 is a visual representation of an example output of diarization machine. In FIG. 6, a vertical axis is used to denote WHO (e.g., Bob) is speaking (i.e., identifying the user of the client device); the horizontal axis denotes WHEN (e.g., 30.01 s-34.87 s) that speaker is speaking; and the depth axis denotes from WHERE (e.g., 23°) that speaker is speaking. Diarization machine 132 uses this WHO/WHEN/WHERE information to label corresponding segments 604 of the audio signal(s) 606 under analysis with labels 608. The segments 604 and/or corresponding labels may be output from the diarization machine 132 in any suitable format. The output effectively associates speech with a particular speaker (i.e., identifying the user of the client device) during a conversation among N speakers, and allows the audio signal corresponding to each speech utterance (with WHO/WHEN/WHERE labeling/metadata) to be used for myriad downstream operations. One nonlimiting downstream operation is conversation transcription. FIG. 1B teaches a computerized conference assistant 106 may include a speech recognition machine 130. As shown in FIG. 8, the speech recognition machine 130 may be configured to translate an audio signal of recorded speech (e.g., signals 112, beamformed signal 150, signal 606, and/or segments 604) into text 800. The art also teaches that in some examples, transcribed speech and/or speaker identity information may be gathered by computerized intelligent assistant 1300 in real time, in order to build the transcript in real time, and/or in order to provide notifications to conference participants about the transcribed speech in real time. In some examples, computerized intelligent assistant 1300 may be configured, for a stream of speech audio captured by a microphone, to identify a current speaker and to analyze the speech audio in order to transcribe speech text, substantially in parallel and/or in real time, so that speaker identity and transcribed speech text may be independently available. Accordingly, computerized intelligent assistant 1300 may be able to provide notifications to the conference participants in real time (e.g., for display at companion devices) indicating that another conference participant is currently speaking and including transcribed speech of the other conference participant, even before the other conference participant has finished speaking. The art also teaches that the transcript may be analyzed using any suitable machine learning (ML) and/or artificial intelligence (AI) techniques, wherein such analysis may include, for raw audio observed during a conference, recognizing text corresponding to the raw audio, and recognizing one or more salient features of the text and/or raw audio. Non-limiting examples of salient features that may be recognized by ML and/or AI techniques include 1) an intent (e.g., an intended task of a conference participant), 2) a context (e.g., a task currently being performed by a conference participant), 3) a topic and/or 4) an action item or commitment (e.g., a task that a conference participant promises to perform))).
Therefore, the combination of Sinkov and Diamant discloses determining that an action item comprising a task to be completed is associated with a first user of the first client device by determining, based on comparing the first plurality of volumes and the second plurality of volumes, that a segment of the audio content that includes a description of the action item is contributed by the first user. 

Applicant’s argument 2:
Sinkov, whether considered singly or in combination with the other cited references, further fails to describe, teach, or suggest each limitation recited by currently amended dependent claim 9. In particular, Sinkov, whether considered singly or in combination with the other cited references, fails to describe, teach, or suggest "associating the action item with the first user comprises: generating an action item prompt to complete the action item; and providing the action item prompt for display on the first client device." As previously mentioned, the Office Action relies on portions of Diamant to cover identification of an action item. While the referenced portions of Diamant describe identifying an action item, they fail to discuss functions performed in response to identifying the action item. Indeed, they fail to teach or suggest "generating an action item prompt to complete the action item" and providing the action item prompt for display on the first client device." Thus, the combination of Sinkov and Diamant fails to teach or suggest every limitation of currently amended dependent claim 9. 

Examiners’ response to argument 2:
The examiners respectfully disagree. Diamant discloses generating an action item prompt to complete the action item" and providing the action item prompt for display on the first client device (Parag. [0109-0111]; (The art teaches that reviewable transcript may be analyzed using any suitable machine learning (ML) and/or artificial intelligence (AI) techniques, wherein such analysis may include, for raw audio observed during a conference, recognizing text corresponding to the raw audio, and recognizing one or more salient features of the text and/or raw audio. Non-limiting examples of salient features that may be recognized by ML and/or AI techniques include 1) an intent (e.g., an intended task of a conference participant), 2) a context (e.g., a task currently being performed by a conference participant), 3) a topic and/or 4) an action item or commitment (e.g., a task that a conference participant promises to perform). The art teaches that a reviewable transcript may be provided to other individuals instead of or in addition to providing the reviewable transcript to conference participants (i.e., including the first client device). In an example, a reviewable transcript may be provided to a supervisor, colleague, or employee of a conference participant. In an example, the conference leader or any other suitable member of an organization associated with the conference may restrict sharing of the reviewable transcript (e.g., so that the conference leader's permission is needed for sharing, or so that the reviewable transcript can only be shared within the organization, in accordance with security and/or privacy policies of the organization). The reviewable transcript may be shared in an unabridged and/or edited form, e.g., the conference leader may initially review the reviewable transcript in order to redact sensitive information, before sharing the redacted transcript with any suitable individuals. The reviewable transcript may be filtered to focus on content of interest (e.g., name mentions and action items) for any individual receiving the reviewable transcript. i.e., the action item prompt is generated and displayed)).




















Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1-4, 8-21, and 23 are rejected under 35 U.S.C. 103 as being unpatentable over Sinkov et al. (Pub. No. US 2019/0200121), hereinafter Sinkov; in view of Diamant (Pub. No. US 2019/0341050), hereinafter Diamant.

Claim 1. 	Sinkov discloses a computer-implemented method comprising: 
receiving, by a digital content management system, a first set of audio data from a first client device, the first set of audio data comprising audio content corresponding to speech from a plurality of participants of a meeting captured in a first plurality of volumes (Parag. [0009] and Parag. [0017-0021]; (The art teaches determining which of a plurality of specific personal audio input audio devices correspond to which specific meeting participants, measuring volume levels at each of the personal audio input devices in response to each of the meeting participants speaking, identifying that a first particular one of the participants is speaking based on relative volume levels at each of the personal audio input devices. Recording audio information from a meeting may also include simultaneously recording audio input (i.e., first set of audio data) at the first one of the personal audio input audio devices and on the first channel and audio input at the second one of the personal audio input audio devices and on the second channel in response to the first and second meeting participants speaking at the same time. Recording audio information from a meeting may also include filtering the audio input at the first channel and the second channel to separate speech by the first participant from speech by the second participant. Filtering the audio input may be based on a distance related volume weakening coefficient, signal latency between the personal audio input devices, and/or ambient noise. In the event of double-talk when two or more speakers talk simultaneously for a period of time, the system may initially identify each speaker, and record double-talk on all principal smartphones owned by current speakers. After a double talk episode has ended, the system may attempt clearing each recorded fragment from double-talk by non-owners prior to placing it into the corresponding speaker channel. Such clearing may be facilitated by simultaneous processing of recorded fragments from all principal phones engaged in the double-talk)); 
receiving, by the digital content management system, a second set of audio data from a second client device, the second set of audio data comprising the audio content corresponding to the speech from the plurality of participants of the meeting captured in a second plurality of volumes (Parag. [0009] and Parag. [0017-0021]; (The art teaches determining which of a plurality of specific personal audio input audio devices correspond to which specific meeting participants, measuring volume levels at each of the personal audio input devices in response to each of the meeting participants speaking, identifying that a first particular one of the participants is speaking based on relative volume levels at each of the personal audio input devices. Recording audio information from a meeting may also include simultaneously recording audio input at the first one of the personal audio input audio devices and on the first channel and audio input (i.e., second set of audio data) at the second one of the personal audio input audio devices and on the second channel in response to the first and second meeting participants speaking at the same time. Recording audio information from a meeting may also include filtering the audio input at the first channel and the second channel to separate speech by the first participant from speech by the second participant. Filtering the audio input may be based on a distance related volume weakening coefficient, signal latency between the personal audio input devices, and/or ambient noise. In the event of double-talk when two or more speakers talk simultaneously for a period of time, the system may initially identify each speaker, and record double-talk on all principal smartphones owned by current speakers. After a double talk episode has ended, the system may attempt clearing each recorded fragment from double-talk by non-owners prior to placing it into the corresponding speaker channel. Such clearing may be facilitated by simultaneous processing of recorded fragments from all principal phones engaged in the double-talk)); 
comparing, by the digital content management system, the first plurality of volumes and the second plurality of volumes (Parag. [0009] and Parag. [0018-0021]; (The art teaches determining which of a plurality of specific personal audio input audio devices correspond to which specific meeting participants, measuring volume levels at each of the personal audio input devices in response to each of the meeting participants speaking, identifying that a first particular one of the participants is speaking based on relative volume levels at each of the personal audio input devices. Recording audio information from a meeting may also include simultaneously recording audio input at the first one of the personal audio input audio devices and on the first channel and audio input at the second one of the personal audio input audio devices and on the second channel in response to the first and second meeting participants speaking at the same time. Recording audio information from a meeting may also include filtering the audio input at the first channel and the second channel to separate speech by the first participant from speech by the second participant. Filtering the audio input may be based on a distance related volume weakening coefficient, signal latency between the personal audio input devices, and/or ambient noise. In the event of double-talk when two or more speakers talk simultaneously for a period of time, the system may initially identify each speaker, and record double-talk on all principal smartphones owned by current speakers. The availability of two symmetric cross-recordings may facilitate assessing the coefficients (after an initial cancellation of ambient noises) and filtering out the weaker components using, for example, echo cancellation technique. Even if the double-talk suppression process has not fully succeeded, each channel unambiguously represents a corresponding speaker and any mix of speaker voices may be instantly identified in a full record by referring to the simultaneous recording by other principal phone(s), i.e. by switching channels of simultaneous speakers)); Responsive to Office Action mailed February 24, 2022  
determining, based on comparing the first plurality of volumes and the second plurality of volumes, that a segment of the audio content is contributed by the first user (Parag. [0009] and Parag. [0017-0021]; (The art teaches determining which of a plurality of specific personal audio input audio devices correspond to which specific meeting participants, measuring volume levels at each of the personal audio input devices in response to each of the meeting participants speaking, identifying that a first particular one of the participants is speaking based on relative volume levels at each of the personal audio input devices. The art teaches that once the current speaker is identified, a particular smartphone of the speaker is marked by the system as a principal recording device and the system tracks a corresponding fragment of audio recording by that particular smartphone until a sufficiently long pause when the speaker either stopped talking to change the subject or for other reason or until the current speaker is replaced by another speaker. In either case, the fragment is picked by the system and added to the current channel of the speaker. Each channel of each speaker may therefore include subsequent fragments by a single speaker, uninterrupted by others and separated by pauses (such fragments may of course may be merged during post-processing of the meeting recording) or fragments separated in time by audio fragments from other speakers recorded in their channels)).
Sinkov doesn’t explicitly disclose determining that an action item comprising a task to be completed is associated with a first user of the first client device by determining that a segment of the audio content that includes a description of the action item is contributed by the first user; and associating, by the digital content management system, the action item with the first user.
However, Diamant discloses:
determining that an action item comprising a task to be completed is associated with a first user of the first client device by determining that a segment of the audio content that includes a description of the action item is contributed by the first user (Parag. [0052-0053], Parag. [0109], Parag. [0115], and Parag. [0138]; (The art teaches that FIG. 7 is a visual representation of an example output of diarization machine. In FIG. 6, a vertical axis is used to denote WHO (e.g., Bob) is speaking (i.e., identifying the user of the client device); the horizontal axis denotes WHEN (e.g., 30.01 s-34.87 s) that speaker is speaking; and the depth axis denotes from WHERE (e.g., 23°) that speaker is speaking. Diarization machine 132 uses this WHO/WHEN/WHERE information to label corresponding segments 604 of the audio signal(s) 606 under analysis with labels 608. The segments 604 and/or corresponding labels may be output from the diarization machine 132 in any suitable format. The output effectively associates speech with a particular speaker (i.e., identifying the user of the client device) during a conversation among N speakers, and allows the audio signal corresponding to each speech utterance (with WHO/WHEN/WHERE labeling/metadata) to be used for myriad downstream operations. One nonlimiting downstream operation is conversation transcription. FIG. 1B teaches a computerized conference assistant 106 may include a speech recognition machine 130. As shown in FIG. 8, the speech recognition machine 130 may be configured to translate an audio signal of recorded speech (e.g., signals 112, beamformed signal 150, signal 606, and/or segments 604) into text 800. The art also teaches that in some examples, transcribed speech and/or speaker identity information may be gathered by computerized intelligent assistant 1300 in real time, in order to build the transcript in real time, and/or in order to provide notifications to conference participants about the transcribed speech in real time. In some examples, computerized intelligent assistant 1300 may be configured, for a stream of speech audio captured by a microphone, to identify a current speaker and to analyze the speech audio in order to transcribe speech text, substantially in parallel and/or in real time, so that speaker identity and transcribed speech text may be independently available. Accordingly, computerized intelligent assistant 1300 may be able to provide notifications to the conference participants in real time (e.g., for display at companion devices) indicating that another conference participant is currently speaking and including transcribed speech of the other conference participant, even before the other conference participant has finished speaking. The art also teaches that the transcript may be analyzed using any suitable machine learning (ML) and/or artificial intelligence (AI) techniques, wherein such analysis may include, for raw audio observed during a conference, recognizing text corresponding to the raw audio, and recognizing one or more salient features of the text and/or raw audio. Non-limiting examples of salient features that may be recognized by ML and/or AI techniques include 1) an intent (e.g., an intended task of a conference participant), 2) a context (e.g., a task currently being performed by a conference participant), 3) a topic and/or 4) an action item or commitment (e.g., a task that a conference participant promises to perform))); and 
associating, by the digital content management system, the action item with the first user (Parag. [0052-0053], Parag. [0109], Parag. [0115], and Parag. [0138]; (The art teaches that the transcript may be analyzed using any suitable machine learning (ML) and/or artificial intelligence (AI) techniques, wherein such analysis may include, for raw audio observed during a conference, recognizing text corresponding to the raw audio, and recognizing one or more salient features of the text and/or raw audio. Non-limiting examples of salient features that may be recognized by ML and/or AI techniques include 1) an intent (e.g., an intended task of a conference participant), 2) a context (e.g., a task currently being performed by a conference participant), 3) a topic and/or 4) an action item or commitment (e.g., a task that a conference participant promises to perform) (i.e., action item))).  
It would be obvious to one of ordinary skill in the art at the time before the effective filling date of the claimed invention to modify Sinkov to incorporate the teaching of Diamant. This would be convenient to coordinate the conference, by providing a transcript of the conference to conference participants for subsequent review, tracking arrivals and departures of conference participants, providing cues to conference participants during the conference, and/or analyzing the information in order to summarize one or more aspects of the conference for subsequent review (Parag. [0024]).

Claim 2. 	Sinkov in view of Diamant discloses the computer-implemented method of claim 1, 
Sinkov further discloses wherein the first set of audio data further comprises a time-based record of the first plurality of volumes captured by the first client device; and further comprising analyzing the first set of audio data to determine a primary speaking volume associated with the first client device by analyzing the time-based record of the first plurality of volumes to determine the primary speaking volume (Parag. [0009] and Parag. [0017-0021]; (The art teaches determining which of a plurality of specific personal audio input audio devices correspond to which specific meeting participants, measuring volume levels at each of the personal audio input devices in response to each of the meeting participants speaking, identifying that a first particular one of the participants is speaking based on relative volume levels at each of the personal audio input devices. Recording audio information from a meeting may also include simultaneously recording audio input (i.e., first set of audio data) at the first one of the personal audio input audio devices and on the first channel and audio input at the second one of the personal audio input audio devices and on the second channel in response to the first and second meeting participants speaking at the same time. Recording audio information from a meeting may also include filtering the audio input at the first channel and the second channel to separate speech by the first participant from speech by the second participant. Filtering the audio input may be based on a distance related volume weakening coefficient, signal latency between the personal audio input devices, and/or ambient noise. In the event of double-talk when two or more speakers talk simultaneously for a period of time, the system may initially identify each speaker, and record double-talk on all principal smartphones owned by current speakers)).  
 
Claim 3. 	Sinkov in view of Diamant discloses the computer-implemented method of claim 1,  
Sinkov further discloses the computer-implemented method further comprising determining a primary speaking volume associated with the first client device based on comparing th(Parag. [0009] and Parag. [0018-0021]; (The art teaches determining which of a plurality of specific personal audio input audio devices correspond to which specific meeting participants, measuring volume levels at each of the personal audio input devices in response to each of the meeting participants speaking, identifying that a first particular one of the participants is speaking based on relative volume levels at each of the personal audio input devices. Recording audio information from a meeting may also include simultaneously recording audio input at the first one of the personal audio input audio devices and on the first channel and audio input at the second one of the personal audio input audio devices and on the second channel in response to the first and second meeting participants speaking at the same time. Recording audio information from a meeting may also include filtering the audio input at the first channel and the second channel to separate speech by the first participant from speech by the second participant. Filtering the audio input may be based on a distance related volume weakening coefficient, signal latency between the personal audio input devices, and/or ambient noise. In the event of double-talk when two or more speakers talk simultaneously for a period of time, the system may initially identify each speaker, and record double-talk on all principal smartphones owned by current speakers. The availability of two symmetric cross-recordings may facilitate assessing the coefficients (after an initial cancellation of ambient noises) and filtering out the weaker components using, for example, echo cancellation technique. Even if the double-talk suppression process has not fully succeeded, each channel unambiguously represents a corresponding speaker and any mix of speaker voices may be instantly identified in a full record by referring to the simultaneous recording by other principal phone(s), i.e. by switching channels of simultaneous speakers)).  
Page 3 of 17
Claim 4. 	Sinkov in view of Diamant discloses the computer-implemented method of claim 1,  
Sinkov doesn’t explicitly disclose the computer-implemented method further comprising receiving, from a computer application installed on the first client device, an authentication of the first user, wherein determining that the segment of the audio content is contributed by the first user with the segment of the audio content is further based on the authentication of the first user.   
However, Diamant discloses:  
receiving, from a computer application installed on the first client device, an authentication of the first user (Parag. [0095]; (The art teaches that Computerized intelligent assistant may be configured to track the arrival of a remote participant based on the remote participant logging in to a remote conferencing program (e.g., a messaging application, voice and/or video chat application, or any other suitable interface for remote interaction))), wherein determining that the segment of the audio content is contributed by the first user with the segment of the audio content is further based on the authentication of the first user (Parag. [0052-0053], Parag. [0109], Parag. [0115], and Parag. [0138]; (The art teaches that FIG. 7 is a visual representation of an example output of diarization machine. In FIG. 6, a vertical axis is used to denote WHO (e.g., Bob) is speaking; the horizontal axis denotes WHEN (e.g., 30.01 s-34.87 s) that speaker is speaking; and the depth axis denotes from WHERE (e.g., 23°) that speaker is speaking. Diarization machine 132 uses this WHO/WHEN/WHERE information to label corresponding segments 604 of the audio signal(s) 606 under analysis with labels 608)).
It would be obvious to one of ordinary skill in the art at the time before the effective filling date of the claimed invention to modify Sinkov to incorporate the teaching of Diamant. This would be convenient to coordinate the conference, by providing a transcript of the conference to conference participants for subsequent review, tracking arrivals and departures of conference participants, providing cues to conference participants during the conference, and/or analyzing the information in order to summarize one or more aspects of the conference for subsequent review (Parag. [0024]).  

Claim 8. 	Sinkov in view of Diamant discloses the computer-implemented method of claim 1,  
Sinkov doesn’t explicitly disclose the computer-implemented method further comprising: generating an identification tag corresponding to the first user; and modifying a meeting transcript comprising a text representation of the audio content by associating the identification tag with the segment of the audio content.  
However, Diamant discloses generating an identification tag corresponding to the first user; and modifying a meeting transcript comprising a text representation of the audio content by associating the identification tag with the segment of the audio content (Parag. [0052-0053], Parag. [0109], Parag. [0115], and Parag. [0138]; (The art teaches that FIG. 7 is a visual representation of an example output of diarization machine. In FIG. 6, a vertical axis is used to denote WHO (e.g., Bob) is speaking; the horizontal axis denotes WHEN (e.g., 30.01 s-34.87 s) that speaker is speaking; and the depth axis denotes from WHERE (e.g., 23°) that speaker is speaking. Diarization machine 132 uses this WHO/WHEN/WHERE information to label corresponding segments 604 of the audio signal(s) 606 under analysis with labels 608. The segments 604 and/or corresponding labels may be output from the diarization machine 132 in any suitable format. The output effectively associates speech with a particular speaker during a conversation among N speakers, and allows the audio signal corresponding to each speech utterance (with WHO/WHEN/WHERE labeling/metadata) to be used for myriad downstream operations. One nonlimiting downstream operation is conversation transcription. The art also teaches that the transcript may be analyzed using any suitable machine learning (ML) and/or artificial intelligence (AI) techniques, wherein such analysis may include, for raw audio observed during a conference, recognizing text corresponding to the raw audio, and recognizing one or more salient features of the text and/or raw audio. Non-limiting examples of salient features that may be recognized by ML and/or AI techniques include 1) an intent (e.g., an intended task of a conference participant), 2) a context (e.g., a task currently being performed by a conference participant), 3) a topic and/or 4) an action item or commitment (e.g., a task that a conference participant promises to perform) (i.e., meeting item), Also, in an example, the art teaches that the machine learning classifier may be configured to receive any other suitable transcript data automatically recorded at 211, e.g., transcribed speech audio in the form of text. The transcription machine may be configured to analyze the transcript to detect words having a predefined sentiment (e.g., positive, negative, “happy”, or any other suitable sentiment), in order to present a sentiment analysis summary at a companion device of a conference participant (i.e., first user), indicating a frequency of utterance of words having the predefined sentiment)).  
It would be obvious to one of ordinary skill in the art at the time before the effective filling date of the claimed invention to modify Sinkov to incorporate the teaching of Diamant. This would be convenient to coordinate the conference, by providing a transcript of the conference to conference participants for subsequent review, tracking arrivals and departures of conference participants, providing cues to conference participants during the conference, and/or analyzing the information in order to summarize one or more aspects of the conference for subsequent review (Parag. [0024]).

Claim 9. 	Sinkov in view of Diamant discloses the computer-implemented method of claim 1,   
Sinkov doesn’t explicitly disclose wherein: associating the action item with the first user comprises: generating an action item prompt to complete the action item; and providing the action item prompt for display on the first client device.  
However, Diamant discloses associating the action item with the first user comprises: generating an action item prompt to complete the action item; and providing the action item prompt for display on the first client device. (Parag. [0109-0111]; (The art teaches that reviewable transcript may be analyzed using any suitable machine learning (ML) and/or artificial intelligence (AI) techniques, wherein such analysis may include, for raw audio observed during a conference, recognizing text corresponding to the raw audio, and recognizing one or more salient features of the text and/or raw audio. Non-limiting examples of salient features that may be recognized by ML and/or AI techniques include 1) an intent (e.g., an intended task of a conference participant), 2) a context (e.g., a task currently being performed by a conference participant), 3) a topic and/or 4) an action item or commitment (e.g., a task that a conference participant promises to perform). The art teaches that a reviewable transcript may be provided to other individuals instead of or in addition to providing the reviewable transcript to conference participants (i.e., including the first client device). In an example, a reviewable transcript may be provided to a supervisor, colleague, or employee of a conference participant. In an example, the conference leader or any other suitable member of an organization associated with the conference may restrict sharing of the reviewable transcript (e.g., so that the conference leader's permission is needed for sharing, or so that the reviewable transcript can only be shared within the organization, in accordance with security and/or privacy policies of the organization). The reviewable transcript may be shared in an unabridged and/or edited form, e.g., the conference leader may initially review the reviewable transcript in order to redact sensitive information, before sharing the redacted transcript with any suitable individuals. The reviewable transcript may be filtered to focus on content of interest (e.g., name mentions and action items) for any individual receiving the reviewable transcript. i.e., the action item prompt is generated and displayed)). 
It would be obvious to one of ordinary skill in the art at the time before the effective filling date of the claimed invention to modify Sinkov to incorporate the teaching of Diamant. This would be convenient to coordinate the conference, by providing a transcript of the conference to conference participants for subsequent review, tracking arrivals and departures of conference participants, providing cues to conference participants during the conference, and/or analyzing the information in order to summarize one or more aspects of the conference for subsequent review (Parag. [0024]). 
Page 5 of 17
Claim 10. 	Sinkov in view of Diamant discloses the computer-implemented method of claim 1,  
Sinkov doesn’t explicitly disclose the computer-implemented method further comprising generating a transcript of the audio content based on at least one of the first set of audio data or the second set of audio data.
However, Diamant discloses further comprising generating a transcript of the audio content based on at least one of the first set of audio data or the second set of audio data (Parag. [0060]; (The art teaches that Labeled and/or partially labelled audio segments may be used to not only determine which of a plurality of N speakers is responsible for an utterance, but also translate the utterance into a textural representation for downstream operations, such as transcription)). Page 6 of 17  
It would be obvious to one of ordinary skill in the art at the time before the effective filling date of the claimed invention to modify Sinkov to incorporate the teaching of Diamant. This would be convenient to coordinate the conference, by providing a transcript of the conference to conference participants for subsequent review, tracking arrivals and departures of conference participants, providing cues to conference participants during the conference, and/or analyzing the information in order to summarize one or more aspects of the conference for subsequent review (Parag. [0024]).

Claim 11. 	Sinkov discloses a non-transitory computer readable storage medium comprising instructions that, when executed by at least one processor (Parag. [0009-0010]), cause a computing device to: 
receive a first set of audio data from a first client device, the first set of audio data comprising audio content corresponding to speech from a plurality of participants of a meeting captured in a first plurality of volumes (Parag. [0009] and Parag. [0017-0021]; (The art teaches determining which of a plurality of specific personal audio input audio devices correspond to which specific meeting participants, measuring volume levels at each of the personal audio input devices in response to each of the meeting participants speaking, identifying that a first particular one of the participants is speaking based on relative volume levels at each of the personal audio input devices. Recording audio information from a meeting may also include simultaneously recording audio input (i.e., first set of audio data) at the first one of the personal audio input audio devices and on the first channel and audio input at the second one of the personal audio input audio devices and on the second channel in response to the first and second meeting participants speaking at the same time. Recording audio information from a meeting may also include filtering the audio input at the first channel and the second channel to separate speech by the first participant from speech by the second participant. Filtering the audio input may be based on a distance related volume weakening coefficient, signal latency between the personal audio input devices, and/or ambient noise. In the event of double-talk when two or more speakers talk simultaneously for a period of time, the system may initially identify each speaker, and record double-talk on all principal smartphones owned by current speakers. After a double talk episode has ended, the system may attempt clearing each recorded fragment from double-talk by non-owners prior to placing it into the corresponding speaker channel. Such clearing may be facilitated by simultaneous processing of recorded fragments from all principal phones engaged in the double-talk));  
receive a second set of audio data from a second client device, the second set of audio data comprising the audio content corresponding to the speech from the plurality of participants of the meeting captured in a second plurality of volumes (Parag. [0009] and Parag. [0017-0021]; (The art teaches determining which of a plurality of specific personal audio input audio devices correspond to which specific meeting participants, measuring volume levels at each of the personal audio input devices in response to each of the meeting participants speaking, identifying that a first particular one of the participants is speaking based on relative volume levels at each of the personal audio input devices. Recording audio information from a meeting may also include simultaneously recording audio input at the first one of the personal audio input audio devices and on the first channel and audio input (i.e., second set of audio data) at the second one of the personal audio input audio devices and on the second channel in response to the first and second meeting participants speaking at the same time. Recording audio information from a meeting may also include filtering the audio input at the first channel and the second channel to separate speech by the first participant from speech by the second participant. Filtering the audio input may be based on a distance related volume weakening coefficient, signal latency between the personal audio input devices, and/or ambient noise. In the event of double-talk when two or more speakers talk simultaneously for a period of time, the system may initially identify each speaker, and record double-talk on all principal smartphones owned by current speakers. After a double talk episode has ended, the system may attempt clearing each recorded fragment from double-talk by non-owners prior to placing it into the corresponding speaker channel. Such clearing may be facilitated by simultaneous processing of recorded fragments from all principal phones engaged in the double-talk)),
compare the first plurality of volumes and the second plurality of volumes (Parag. [0009] and Parag. [0018-0021]; (The art teaches determining which of a plurality of specific personal audio input audio devices correspond to which specific meeting participants, measuring volume levels at each of the personal audio input devices in response to each of the meeting participants speaking, identifying that a first particular one of the participants is speaking based on relative volume levels at each of the personal audio input devices. Recording audio information from a meeting may also include simultaneously recording audio input at the first one of the personal audio input audio devices and on the first channel and audio input at the second one of the personal audio input audio devices and on the second channel in response to the first and second meeting participants speaking at the same time. Recording audio information from a meeting may also include filtering the audio input at the first channel and the second channel to separate speech by the first participant from speech by the second participant. Filtering the audio input may be based on a distance related volume weakening coefficient, signal latency between the personal audio input devices, and/or ambient noise. In the event of double-talk when two or more speakers talk simultaneously for a period of time, the system may initially identify each speaker, and record double-talk on all principal smartphones owned by current speakers. The availability of two symmetric cross-recordings may facilitate assessing the coefficients (after an initial cancellation of ambient noises) and filtering out the weaker components using, for example, echo cancellation technique. Even if the double-talk suppression process has not fully succeeded, each channel unambiguously represents a corresponding speaker and any mix of speaker voices may be instantly identified in a full record by referring to the simultaneous recording by other principal phone(s), i.e. by switching channels of simultaneous speakers)); 
determining, based on comparing the first plurality of volumes and the second plurality of volumes, that a segment of the audio content is contributed by the first user (Parag. [0009] and Parag. [0017-0021]; (The art teaches determining which of a plurality of specific personal audio input audio devices correspond to which specific meeting participants, measuring volume levels at each of the personal audio input devices in response to each of the meeting participants speaking, identifying that a first particular one of the participants is speaking based on relative volume levels at each of the personal audio input devices. The art teaches that once the current speaker is identified, a particular smartphone of the speaker is marked by the system as a principal recording device and the system tracks a corresponding fragment of audio recording by that particular smartphone until a sufficiently long pause when the speaker either stopped talking to change the subject or for other reason or until the current speaker is replaced by another speaker. In either case, the fragment is picked by the system and added to the current channel of the speaker. Each channel of each speaker may therefore include subsequent fragments by a single speaker, uninterrupted by others and separated by pauses (such fragments may of course may be merged during post-processing of the meeting recording) or fragments separated in time by audio fragments from other speakers recorded in their channels))
Sinkov doesn’t explicitly disclose determine that an action item comprising a task to be completed is associated with a first user of the first client device by determining that a segment of the audio content that includes a description of the action item is contributed by the first user; and associate the action item with the first user.
However, Diamant discloses:
determine that an action item comprising a task to be completed is associated with a first user of the first client device by determining that a segment of the audio content that includes a description of the action item is contributed by the first user (Parag. [0052-0053], Parag. [0109], Parag. [0115], and Parag. [0138]; (The art teaches that FIG. 7 is a visual representation of an example output of diarization machine. In FIG. 6, a vertical axis is used to denote WHO (e.g., Bob) is speaking (i.e., identifying the user of the client device); the horizontal axis denotes WHEN (e.g., 30.01 s-34.87 s) that speaker is speaking; and the depth axis denotes from WHERE (e.g., 23°) that speaker is speaking. Diarization machine 132 uses this WHO/WHEN/WHERE information to label corresponding segments 604 of the audio signal(s) 606 under analysis with labels 608. The segments 604 and/or corresponding labels may be output from the diarization machine 132 in any suitable format. The output effectively associates speech with a particular speaker (i.e., identifying the user of the client device) during a conversation among N speakers, and allows the audio signal corresponding to each speech utterance (with WHO/WHEN/WHERE labeling/metadata) to be used for myriad downstream operations. One nonlimiting downstream operation is conversation transcription. FIG. 1B teaches a computerized conference assistant 106 may include a speech recognition machine 130. As shown in FIG. 8, the speech recognition machine 130 may be configured to translate an audio signal of recorded speech (e.g., signals 112, beamformed signal 150, signal 606, and/or segments 604) into text 800. The art also teaches that in some examples, transcribed speech and/or speaker identity information may be gathered by computerized intelligent assistant 1300 in real time, in order to build the transcript in real time, and/or in order to provide notifications to conference participants about the transcribed speech in real time. In some examples, computerized intelligent assistant 1300 may be configured, for a stream of speech audio captured by a microphone, to identify a current speaker and to analyze the speech audio in order to transcribe speech text, substantially in parallel and/or in real time, so that speaker identity and transcribed speech text may be independently available. Accordingly, computerized intelligent assistant 1300 may be able to provide notifications to the conference participants in real time (e.g., for display at companion devices) indicating that another conference participant is currently speaking and including transcribed speech of the other conference participant, even before the other conference participant has finished speaking. The art also teaches that the transcript may be analyzed using any suitable machine learning (ML) and/or artificial intelligence (AI) techniques, wherein such analysis may include, for raw audio observed during a conference, recognizing text corresponding to the raw audio, and recognizing one or more salient features of the text and/or raw audio. Non-limiting examples of salient features that may be recognized by ML and/or AI techniques include 1) an intent (e.g., an intended task of a conference participant), 2) a context (e.g., a task currently being performed by a conference participant), 3) a topic and/or 4) an action item or commitment (e.g., a task that a conference participant promises to perform))); and 
associate the action item with the first user (Parag. [0052-0053], Parag. [0109], Parag. [0115], and Parag. [0138]; (The art teaches that the transcript may be analyzed using any suitable machine learning (ML) and/or artificial intelligence (AI) techniques, wherein such analysis may include, for raw audio observed during a conference, recognizing text corresponding to the raw audio, and recognizing one or more salient features of the text and/or raw audio. Non-limiting examples of salient features that may be recognized by ML and/or AI techniques include 1) an intent (e.g., an intended task of a conference participant), 2) a context (e.g., a task currently being performed by a conference participant), 3) a topic and/or 4) an action item or commitment (e.g., a task that a conference participant promises to perform) (i.e., action item))).
It would be obvious to one of ordinary skill in the art at the time before the effective filling date of the claimed invention to modify Sinkov to incorporate the teaching of Diamant. This would be convenient to coordinate the conference, by providing a transcript of the conference to conference participants for subsequent review, tracking arrivals and departures of conference participants, providing cues to conference participants during the conference, and/or analyzing the information in order to summarize one or more aspects of the conference for subsequent review (Parag. [0024]).

Claim 12. 	Sinkov in view of Diamant discloses the non-transitory computer readable storage medium of claim 11, 
Sinkov further discloses wherein the first set of audio data further comprises volume data corresponding to thfurther comprising instructions that, when executed by the at least one processor, cause the computing device to analyze the first set of audio data to determine a primary speaking volume associated with the first client device by analyzing the volume data of the first set of audio data to determine the primary speaking volume (Parag. [0009] and Parag. [0017-0021]; (The art teaches determining which of a plurality of specific personal audio input audio devices correspond to which specific meeting participants, measuring volume levels at each of the personal audio input devices in response to each of the meeting participants speaking, identifying that a first particular one of the participants is speaking based on relative volume levels at each of the personal audio input devices. Recording audio information from a meeting may also include simultaneously recording audio input (i.e., first set of audio data) at the first one of the personal audio input audio devices and on the first channel and audio input at the second one of the personal audio input audio devices and on the second channel in response to the first and second meeting participants speaking at the same time. Recording audio information from a meeting may also include filtering the audio input at the first channel and the second channel to separate speech by the first participant from speech by the second participant. Filtering the audio input may be based on a distance related volume weakening coefficient, signal latency between the personal audio input devices, and/or ambient noise. In the event of double-talk when two or more speakers talk simultaneously for a period of time, the system may initially identify each speaker, and record double-talk on all principal smartphones owned by current speakers)).  
 
Claim 13. 	Sinkov in view of Diamant discloses the non-transitory computer readable storage medium of claim 11,   
Sinkov doesn’t explicitly disclose the non-transitory computer readable storage medium further comprising instructions that, when executed by the at least one processor, cause the computing device to: track participation data corresponding to the first user based on the segment of the audio content; and generate a participation report based on the participation data.  
However, Diamant discloses further comprising instructions that, when executed by the at least one processor, cause the computing device to: track participation data corresponding to the first user based on the segment of the audio content; and generate a participation report based on the participation data (Parag. [0004], Parag. [0023-0024], Parag. [0082], Parag. [0109], and Parag. [0138]; (The art teaches that the conference transcript can be used by participants for reviewing various multi-modal interactions and other events of interest that happened in the conference. The conference transcript can be analyzed to provide conference participants with feedback regarding their own participation in the conference, other participants, and team/organizational trends. The art also teaches that the transcript may be analyzed using any suitable machine learning (ML) and/or artificial intelligence (AI) techniques, wherein such analysis may include, for raw audio observed during a conference, recognizing text corresponding to the raw audio, and recognizing one or more salient features of the text and/or raw audio. Non-limiting examples of salient features that may be recognized by ML and/or AI techniques include 1) an intent (e.g., an intended task of a conference participant), 2) a context (e.g., a task currently being performed by a conference participant), 3) a topic and/or 4) an action item or commitment (e.g., a task that a conference participant promises to perform) (i.e., meeting item). Also, in an example, the art teaches that the machine learning classifier may be configured to receive any other suitable transcript data automatically recorded at 211, e.g., transcribed speech audio in the form of text. The transcription machine may be configured to analyze the transcript to detect words having a predefined sentiment (e.g., positive, negative, “happy”, or any other suitable sentiment), in order to present a sentiment analysis summary at a companion device of a conference participant, indicating a frequency of utterance of words having the predefined sentiment)))). 
It would be obvious to one of ordinary skill in the art at the time before the effective filling date of the claimed invention to modify Sinkov to incorporate the teaching of Diamant. This would be convenient to coordinate the conference, by providing a transcript of the conference to conference participants for subsequent review, tracking arrivals and departures of conference participants, providing cues to conference participants during the conference, and/or analyzing the information in order to summarize one or more aspects of the conference for subsequent review (Parag. [0024]).
Page 8 of 17
Claim 14. 	Sinkov in view of Diamant discloses the non-transitory computer readable storage medium of claim 13,   
Sinkov doesn’t explicitly disclose wherein the participation data includes at least one of a length of time spoken by the first user or a number of interruptions by the first user. 
However, Diamant discloses wherein the participation data includes at least one of a length of time spoken by the first user or a number of interruptions by the first user (Parag. [0052-0053], Parag. [0109], Parag. [0115], and Parag. [0138]; (The art teaches that FIG. 7 is a visual representation of an example output of diarization machine. In FIG. 6, a vertical axis is used to denote WHO (e.g., Bob) is speaking; the horizontal axis denotes WHEN (e.g., 30.01 s-34.87 s) that speaker is speaking; and the depth axis denotes from WHERE (e.g., 23°) that speaker is speaking)).  
It would be obvious to one of ordinary skill in the art at the time before the effective filling date of the claimed invention to modify Sinkov to incorporate the teaching of Diamant. This would be convenient to coordinate the conference, by providing a transcript of the conference to conference participants for subsequent review, tracking arrivals and departures of conference participants, providing cues to conference participants during the conference, and/or analyzing the information in order to summarize one or more aspects of the conference for subsequent review (Parag. [0024]).

Claim 15. 	Sinkov in view of Diamant discloses the non-transitory computer readable storage medium of claim 13,   
Sinkov doesn’t explicitly disclose further comprising instructions that, when executed by the at least one processor, cause the computing device to provide the participation report for display on the first client device.  
However, Diamant discloses further comprising instructions that, when executed by the at least one processor, cause the computing device to provide the participation report for display on the first client device (Parag. [0052-0053], Parag. [0109], Parag. [0115], and Parag. [0138]; (The art teaches that FIG. 7 is a visual representation of an example output of diarization machine. In FIG. 6, a vertical axis is used to denote WHO (e.g., Bob) is speaking; the horizontal axis denotes WHEN (e.g., 30.01 s-34.87 s) that speaker is speaking; and the depth axis denotes from WHERE (e.g., 23°) that speaker is speaking. Diarization machine 132 uses this WHO/WHEN/WHERE information to label corresponding segments 604 of the audio signal(s) 606 under analysis with labels 608. The segments 604 and/or corresponding labels may be output from the diarization machine 132 in any suitable format. The output effectively associates speech with a particular speaker during a conversation among N speakers, and allows the audio signal corresponding to each speech utterance (with WHO/WHEN/WHERE labeling/metadata) to be used for myriad downstream operations. One nonlimiting downstream operation is conversation transcription. The art also teaches that the transcript may be analyzed using any suitable machine learning (ML) and/or artificial intelligence (AI) techniques, wherein such analysis may include, for raw audio observed during a conference, recognizing text corresponding to the raw audio, and recognizing one or more salient features of the text and/or raw audio. Non-limiting examples of salient features that may be recognized by ML and/or AI techniques include 1) an intent (e.g., an intended task of a conference participant), 2) a context (e.g., a task currently being performed by a conference participant), 3) a topic and/or 4) an action item or commitment (e.g., a task that a conference participant promises to perform) (i.e., meeting item), Also, in an example, the art teaches that the machine learning classifier may be configured to receive any other suitable transcript data automatically recorded at 211, e.g., transcribed speech audio in the form of text. The transcription machine may be configured to analyze the transcript to detect words having a predefined sentiment (e.g., positive, negative, “happy”, or any other suitable sentiment), in order to present a sentiment analysis (i.e., displayed for the user) summary at a companion device of a conference participant (i.e., first user), indicating a frequency of utterance of words having the predefined sentiment)).  
It would be obvious to one of ordinary skill in the art at the time before the effective filling date of the claimed invention to modify Sinkov to incorporate the teaching of Diamant. This would be convenient to coordinate the conference, by providing a transcript of the conference to conference participants for subsequent review, tracking arrivals and departures of conference participants, providing cues to conference participants during the conference, and/or analyzing the information in order to summarize one or more aspects of the conference for subsequent review (Parag. [0024]).


Claim 16. 	Sinkov in view of Diamant discloses the non-transitory computer readable storage medium of claim 11,   
Sinkov doesn’t explicitly disclose the non-transitory computer readable storage medium further comprising instructions that, when executed by the at least one processor, cause the computing device to receive, from a computer application installed on the first client device, an authentication of the first user generated by submission of one or more login credentials by the first user via the first client device, wherein determining that the segment of the audio content is contributed by the first user is further based on the authentication of the first user. 
However, Diamant discloses instructions that, when executed by the at least one processor, cause the computing device to receive, from a computer application installed on the first client device, an authentication of the first user generated by submission of one or more login credentials by the first user via the first client device (Parag. [0095]; (The art teaches that Computerized intelligent assistant may be configured to track the arrival of a remote participant based on the remote participant logging in to a remote conferencing program (e.g., a messaging application, voice and/or video chat application, or any other suitable interface for remote interaction))), wherein determining that the segment of the audio content is contributed by the first user is further based on the authentication of the first user (Parag. [0052-0053], Parag. [0109], Parag. [0115], and Parag. [0138]; (The art teaches that FIG. 7 is a visual representation of an example output of diarization machine. In FIG. 6, a vertical axis is used to denote WHO (e.g., Bob) is speaking; the horizontal axis denotes WHEN (e.g., 30.01 s-34.87 s) that speaker is speaking; and the depth axis denotes from WHERE (e.g., 23°) that speaker is speaking. Diarization machine 132 uses this WHO/WHEN/WHERE information to label corresponding segments 604 of the audio signal(s) 606 under analysis with labels 608)).
It would be obvious to one of ordinary skill in the art at the time before the effective filling date of the claimed invention to modify Sinkov to incorporate the teaching of Diamant. This would be convenient to coordinate the conference, by providing a transcript of the conference to conference participants for subsequent review, tracking arrivals and departures of conference participants, providing cues to conference participants during the conference, and/or analyzing the information in order to summarize one or more aspects of the conference for subsequent review (Parag. [0024]). 
Claim 

17. 	Sinkov discloses a system comprising: at least one processor; and a non-transitory computer readable storage medium comprising instructions that, when executed by the at least one processor (Parag. [0009-0010]), cause the system to:  
receive a first set of audio data from a first client device, the first set of audio data comprising audio content corresponding to speech from a plurality of participants of a meeting captured in a first plurality of volumes (Parag. [0009] and Parag. [0017-0021]; (The art teaches determining which of a plurality of specific personal audio input audio devices correspond to which specific meeting participants, measuring volume levels at each of the personal audio input devices in response to each of the meeting participants speaking, identifying that a first particular one of the participants is speaking based on relative volume levels at each of the personal audio input devices. Recording audio information from a meeting may also include simultaneously recording audio input (i.e., first set of audio data) at the first one of the personal audio input audio devices and on the first channel and audio input at the second one of the personal audio input audio devices and on the second channel in response to the first and second meeting participants speaking at the same time. Recording audio information from a meeting may also include filtering the audio input at the first channel and the second channel to separate speech by the first participant from speech by the second participant. Filtering the audio input may be based on a distance related volume weakening coefficient, signal latency between the personal audio input devices, and/or ambient noise. In the event of double-talk when two or more speakers talk simultaneously for a period of time, the system may initially identify each speaker, and record double-talk on all principal smartphones owned by current speakers. After a double talk episode has ended, the system may attempt clearing each recorded fragment from double-talk by non-owners prior to placing it into the corresponding speaker channel. Such clearing may be facilitated by simultaneous processing of recorded fragments from all principal phones engaged in the double-talk));  
receive a second set of audio data from a second client device, the second set of audio data comprising the audio content corresponding to the speech from the plurality of participants of the meeting captured in a second plurality of volumes (Parag. [0009] and Parag. [0017-0021]; (The art teaches determining which of a plurality of specific personal audio input audio devices correspond to which specific meeting participants, measuring volume levels at each of the personal audio input devices in response to each of the meeting participants speaking, identifying that a first particular one of the participants is speaking based on relative volume levels at each of the personal audio input devices. Recording audio information from a meeting may also include simultaneously recording audio input at the first one of the personal audio input audio devices and on the first channel and audio input (i.e., second set of audio data) at the second one of the personal audio input audio devices and on the second channel in response to the first and second meeting participants speaking at the same time. Recording audio information from a meeting may also include filtering the audio input at the first channel and the second channel to separate speech by the first participant from speech by the second participant. Filtering the audio input may be based on a distance related volume weakening coefficient, signal latency between the personal audio input devices, and/or ambient noise. In the event of double-talk when two or more speakers talk simultaneously for a period of time, the system may initially identify each speaker, and record double-talk on all principal smartphones owned by current speakers. After a double talk episode has ended, the system may attempt clearing each recorded fragment from double-talk by non-owners prior to placing it into the corresponding speaker channel. Such clearing may be facilitated by simultaneous processing of recorded fragments from all principal phones engaged in the double-talk));   
compare the first plurality of volumes and the second plurality of volumes (Parag. [0009] and Parag. [0018-0021]; (The art teaches determining which of a plurality of specific personal audio input audio devices correspond to which specific meeting participants, measuring volume levels at each of the personal audio input devices in response to each of the meeting participants speaking, identifying that a first particular one of the participants is speaking based on relative volume levels at each of the personal audio input devices. Recording audio information from a meeting may also include simultaneously recording audio input at the first one of the personal audio input audio devices and on the first channel and audio input at the second one of the personal audio input audio devices and on the second channel in response to the first and second meeting participants speaking at the same time. Recording audio information from a meeting may also include filtering the audio input at the first channel and the second channel to separate speech by the first participant from speech by the second participant. Filtering the audio input may be based on a distance related volume weakening coefficient, signal latency between the personal audio input devices, and/or ambient noise. In the event of double-talk when two or more speakers talk simultaneously for a period of time, the system may initially identify each speaker, and record double-talk on all principal smartphones owned by current speakers. The availability of two symmetric cross-recordings may facilitate assessing the coefficients (after an initial cancellation of ambient noises) and filtering out the weaker components using, for example, echo cancellation technique. Even if the double-talk suppression process has not fully succeeded, each channel unambiguously represents a corresponding speaker and any mix of speaker voices may be instantly identified in a full record by referring to the simultaneous recording by other principal phone(s), i.e. by switching channels of simultaneous speakers));
determining, based on comparing the first plurality of volumes and the second plurality of volumes, that a segment of the audio content is contributed by the first user (Parag. [0009] and Parag. [0017-0021]; (The art teaches determining which of a plurality of specific personal audio input audio devices correspond to which specific meeting participants, measuring volume levels at each of the personal audio input devices in response to each of the meeting participants speaking, identifying that a first particular one of the participants is speaking based on relative volume levels at each of the personal audio input devices. The art teaches that once the current speaker is identified, a particular smartphone of the speaker is marked by the system as a principal recording device and the system tracks a corresponding fragment of audio recording by that particular smartphone until a sufficiently long pause when the speaker either stopped talking to change the subject or for other reason or until the current speaker is replaced by another speaker. In either case, the fragment is picked by the system and added to the current channel of the speaker. Each channel of each speaker may therefore include subsequent fragments by a single speaker, uninterrupted by others and separated by pauses (such fragments may of course may be merged during post-processing of the meeting recording) or fragments separated in time by audio fragments from other speakers recorded in their channels))
Sinkov doesn’t explicitly disclose determine that an action item comprising a task to be completed is associated with a first user of the first client device by determining that a segment of the audio content that includes a description of the action item is contributed by the first user; and associate the action item with the first user.
However, Diamant discloses:
determine that an action item comprising a task to be completed is associated with a first user of the first client device by determining that a segment of the audio content that includes a description of the action item is contributed by the first user (Parag. [0052-0053], Parag. [0109], Parag. [0115], and Parag. [0138]; (The art teaches that FIG. 7 is a visual representation of an example output of diarization machine. In FIG. 6, a vertical axis is used to denote WHO (e.g., Bob) is speaking (i.e., identifying the user of the client device); the horizontal axis denotes WHEN (e.g., 30.01 s-34.87 s) that speaker is speaking; and the depth axis denotes from WHERE (e.g., 23°) that speaker is speaking. Diarization machine 132 uses this WHO/WHEN/WHERE information to label corresponding segments 604 of the audio signal(s) 606 under analysis with labels 608. The segments 604 and/or corresponding labels may be output from the diarization machine 132 in any suitable format. The output effectively associates speech with a particular speaker (i.e., identifying the user of the client device) during a conversation among N speakers, and allows the audio signal corresponding to each speech utterance (with WHO/WHEN/WHERE labeling/metadata) to be used for myriad downstream operations. One nonlimiting downstream operation is conversation transcription. FIG. 1B teaches a computerized conference assistant 106 may include a speech recognition machine 130. As shown in FIG. 8, the speech recognition machine 130 may be configured to translate an audio signal of recorded speech (e.g., signals 112, beamformed signal 150, signal 606, and/or segments 604) into text 800. The art also teaches that in some examples, transcribed speech and/or speaker identity information may be gathered by computerized intelligent assistant 1300 in real time, in order to build the transcript in real time, and/or in order to provide notifications to conference participants about the transcribed speech in real time. In some examples, computerized intelligent assistant 1300 may be configured, for a stream of speech audio captured by a microphone, to identify a current speaker and to analyze the speech audio in order to transcribe speech text, substantially in parallel and/or in real time, so that speaker identity and transcribed speech text may be independently available. Accordingly, computerized intelligent assistant 1300 may be able to provide notifications to the conference participants in real time (e.g., for display at companion devices) indicating that another conference participant is currently speaking and including transcribed speech of the other conference participant, even before the other conference participant has finished speaking. The art also teaches that the transcript may be analyzed using any suitable machine learning (ML) and/or artificial intelligence (AI) techniques, wherein such analysis may include, for raw audio observed during a conference, recognizing text corresponding to the raw audio, and recognizing one or more salient features of the text and/or raw audio. Non-limiting examples of salient features that may be recognized by ML and/or AI techniques include 1) an intent (e.g., an intended task of a conference participant), 2) a context (e.g., a task currently being performed by a conference participant), 3) a topic and/or 4) an action item or commitment (e.g., a task that a conference participant promises to perform))); and 
associate the action item with the first user (Parag. [0052-0053], Parag. [0109], Parag. [0115], and Parag. [0138]; (The art teaches that the transcript may be analyzed using any suitable machine learning (ML) and/or artificial intelligence (AI) techniques, wherein such analysis may include, for raw audio observed during a conference, recognizing text corresponding to the raw audio, and recognizing one or more salient features of the text and/or raw audio. Non-limiting examples of salient features that may be recognized by ML and/or AI techniques include 1) an intent (e.g., an intended task of a conference participant), 2) a context (e.g., a task currently being performed by a conference participant), 3) a topic and/or 4) an action item or commitment (e.g., a task that a conference participant promises to perform) (i.e., action item))).
It would be obvious to one of ordinary skill in the art at the time before the effective filling date of the claimed invention to modify Sinkov to incorporate the teaching of Diamant. This would be convenient to coordinate the conference, by providing a transcript of the conference to conference participants for subsequent review, tracking arrivals and departures of conference participants, providing cues to conference participants during the conference, and/or analyzing the information in order to summarize one or more aspects of the conference for subsequent review (Parag. [0024]).  
 
Claim 18. 	Sinkov in view of Diamant discloses the system of claim 17, 
Sinkov doesn’t explicitly disclose the system further comprising instructions that, when executed by the at least one processor, cause the system to generate a transcript of the audio content based on at least one of the first set of audio data or the second set of audio data. 
However, Diamant discloses instructions that, when executed by the at least one processor, cause the system to generate a transcript of the audio content based on at least one of the first set of audio data or the second set of audio data (Parag. [0060]; (The art teaches that Labeled and/or partially labelled audio segments may be used to not only determine which of a plurality of N speakers is responsible for an utterance, but also translate the utterance into a textural representation for downstream operations, such as transcription)). Page 6 of 17  
It would be obvious to one of ordinary skill in the art at the time before the effective filling date of the claimed invention to modify Sinkov to incorporate the teaching of Diamant. This would be convenient to coordinate the conference, by providing a transcript of the conference to conference participants for subsequent review, tracking arrivals and departures of conference participants, providing cues to conference participants during the conference, and/or analyzing the information in order to summarize one or more aspects of the conference for subsequent review (Parag. [0024]). 
 
Claim 19. 	Sinkov in view of Diamant discloses the system of claim 17, 
Sinkov further discloses wherein the instructions, when executed by the at least one processor, causes the system to receive the first set of audio data from th(Parag. [0009] and Parag. [0017-0021]; (The art teaches determining which of a plurality of specific personal audio input audio devices correspond to which specific meeting participants, measuring volume levels at each of the personal audio input devices in response to each of the meeting participants speaking, identifying that a first particular one of the participants is speaking based on relative volume levels at each of the personal audio input devices. Recording audio information from a meeting may also include simultaneously recording audio input (i.e., first set of audio data) at the first one of the personal audio input audio devices and on the first channel and audio input at the second one of the personal audio input audio devices and on the second channel in response to the first and second meeting participants speaking at the same time. Recording audio information from a meeting may also include filtering the audio input at the first channel and the second channel to separate speech by the first participant from speech by the second participant. Filtering the audio input may be based on a distance related volume weakening coefficient, signal latency between the personal audio input devices, and/or ambient noise. In the event of double-talk when two or more speakers talk simultaneously for a period of time, the system may initially identify each speaker, and record double-talk on all principal smartphones owned by current speakers. After a double talk episode has ended, the system may attempt clearing each recorded fragment from double-talk by non-owners prior to placing it into the corresponding speaker channel. Such clearing may be facilitated by simultaneous processing of recorded fragments from all principal phones engaged in the double-talk)). 

Claim 20. 	Sinkov in view of Diamant discloses the system of claim 17, 
Sinkov further discloses further comprising instructions that, when executed by the at least one processor, cause the system to determine a primary speaking volume associated with the first client device by: determining a second primary speaking volume associated with the second client device based on comparing the first plurality of volumes and the second plurality of volumes, and determining the primary speaking volume associated with the first client device based on the second primary speaking volume associated with the second client device (Parag. [0009] and Parag. [0017-0021]; (The art teaches determining which of a plurality of specific personal audio input audio devices correspond to which specific meeting participants, measuring volume levels at each of the personal audio input devices in response to each of the meeting participants speaking, identifying that a first particular one of the participants is speaking based on relative volume levels at each of the personal audio input devices. Recording audio information from a meeting may also include simultaneously recording audio input at the first one of the personal audio input audio devices and on the first channel and audio input at the second one of the personal audio input audio devices and on the second channel in response to the first and second meeting participants speaking at the same time. Recording audio information from a meeting may also include filtering the audio input at the first channel and the second channel to separate speech by the first participant from speech by the second participant. Filtering the audio input may be based on a distance related volume weakening coefficient, signal latency between the personal audio input devices, and/or ambient noise. In the event of double-talk when two or more speakers talk simultaneously for a period of time, the system may initially identify each speaker, and record double-talk on all principal smartphones owned by current speakers. The availability of two symmetric cross-recordings may facilitate assessing the coefficients (after an initial cancellation of ambient noises) and filtering out the weaker components using, for example, echo cancellation technique. Even if the double-talk suppression process has not fully succeeded, each channel unambiguously represents a corresponding speaker and any mix of speaker voices may be instantly identified in a full record by referring to the simultaneous recording by other principal phone(s), i.e. by switching channels of simultaneous speakers. The art teaches that once the current speaker is identified, a particular smartphone of the speaker is marked by the system as a principal recording device and the system tracks a corresponding fragment of audio recording by that particular smartphone until a sufficiently long pause when the speaker either stopped talking to change the subject or for other reason or until the current speaker is replaced by another speaker. In either case, the fragment is picked by the system and added to the current channel of the speaker. Each channel of each speaker may therefore include subsequent fragments by a single speaker, uninterrupted by others and separated by pauses (such fragments may of course may be merged during post-processing of the meeting recording) or fragments separated in time by audio fragments from other speakers recorded in their channels)).

Claim 21. 	Sinkov in view of Diamant discloses the computer-implemented method of claim 1,  
Sinkov further discloses wherein: receiving the first set of audio data comprising the audio content captured in the first plurality of volumes comprises receiving the first set of audio data comprising a first segment of speech from a first meeting participant captured in a first volume by the first client device; and receiving the second set of audio data comprising the audio content captured in the second plurality of volumes comprises receiving the second set of audio data comprising the first segment of speech from the first meeting participant captured in a second volume by the second client device. (Parag. [0009] and Parag. [0017-0021]; (The art teaches determining which of a plurality of specific personal audio input audio devices correspond to which specific meeting participants, measuring volume levels at each of the personal audio input devices in response to each of the meeting participants speaking, identifying that a first particular one of the participants is speaking based on relative volume levels at each of the personal audio input devices. Recording audio information from a meeting may also include simultaneously recording audio input (i.e., first set of audio data) at the first one of the personal audio input audio devices and on the first channel and audio input at the second one of the personal audio input audio devices and on the second channel in response to the first and second meeting participants speaking at the same time. Recording audio information from a meeting may also include filtering the audio input at the first channel and the second channel to separate speech by the first participant from speech by the second participant. Filtering the audio input may be based on a distance related volume weakening coefficient, signal latency between the personal audio input devices, and/or ambient noise. In the event of double-talk when two or more speakers talk simultaneously for a period of time, the system may initially identify each speaker, and record double-talk on all principal smartphones owned by current speakers. After a double talk episode has ended, the system may attempt clearing each recorded fragment from double-talk by non-owners prior to placing it into the corresponding speaker channel. Such clearing may be facilitated by simultaneous processing of recorded fragments from all principal phones engaged in the double-talk)).

Claim 23. 	Sinkov in view of Diamant discloses the computer-implemented method of claim 21,  
Sinkov further discloses wherein determining, based on comparing the first plurality of volumes and the second plurality of volumes, that the Page 12 of 18Responsive to Office Action mailed February 24, 2022segment of the audio content is contributed by the first user comprises: determining, based on comparing the first plurality of volumes and the second plurality of volumes, that the first volume captured by the first client device corresponds to a primary speaking volume for the first client device; determining that the first volume corresponds to speech from the first user of the first client device (Parag. [0009] and Parag. [0017-0021]; (The art teaches determining which of a plurality of specific personal audio input audio devices correspond to which specific meeting participants, measuring volume levels at each of the personal audio input devices in response to each of the meeting participants speaking, identifying that a first particular one of the participants is speaking based on relative volume levels at each of the personal audio input devices. The art teaches that once the current speaker is identified, a particular smartphone of the speaker is marked by the system as a principal recording device and the system tracks a corresponding fragment of audio recording by that particular smartphone until a sufficiently long pause when the speaker either stopped talking to change the subject or for other reason or until the current speaker is replaced by another speaker. In either case, the fragment is picked by the system and added to the current channel of the speaker. Each channel of each speaker may therefore include subsequent fragments by a single speaker, uninterrupted by others and separated by pauses (such fragments may of course may be merged during post-processing of the meeting recording) or fragments separated in time by audio fragments from other speakers recorded in their channels)).
Sinkov doesn’t explicitly disclose that the Page 12 of 18Responsive to Office Action mailed February 24, 2022segment of the audio content includes the description of the action item; and determining that the segment of the audio content that includes the description of the action item is captured by the first client device in the first volume. 
However, Diamant discloses that the Page 12 of 18Responsive to Office Action mailed February 24, 2022segment of the audio content that includes the description of the action item; and determining that the segment of the audio content that includes the description of the action item is captured by the first client device in the first volume (Parag. [0052-0053], Parag. [0109], Parag. [0115], and Parag. [0138]; (The art teaches that FIG. 7 is a visual representation of an example output of diarization machine. In FIG. 6, a vertical axis is used to denote WHO (e.g., Bob) is speaking (i.e., identifying the user of the client device); the horizontal axis denotes WHEN (e.g., 30.01 s-34.87 s) that speaker is speaking; and the depth axis denotes from WHERE (e.g., 23°) that speaker is speaking. Diarization machine 132 uses this WHO/WHEN/WHERE information to label corresponding segments 604 of the audio signal(s) 606 under analysis with labels 608. The segments 604 and/or corresponding labels may be output from the diarization machine 132 in any suitable format. The output effectively associates speech with a particular speaker (i.e., identifying the user of the client device) during a conversation among N speakers, and allows the audio signal corresponding to each speech utterance (with WHO/WHEN/WHERE labeling/metadata) to be used for myriad downstream operations. One nonlimiting downstream operation is conversation transcription. FIG. 1B teaches a computerized conference assistant 106 may include a speech recognition machine 130. As shown in FIG. 8, the speech recognition machine 130 may be configured to translate an audio signal of recorded speech (e.g., signals 112, beamformed signal 150, signal 606, and/or segments 604) into text 800. The art also teaches that in some examples, transcribed speech and/or speaker identity information may be gathered by computerized intelligent assistant 1300 in real time, in order to build the transcript in real time, and/or in order to provide notifications to conference participants about the transcribed speech in real time. In some examples, computerized intelligent assistant 1300 may be configured, for a stream of speech audio captured by a microphone, to identify a current speaker and to analyze the speech audio in order to transcribe speech text, substantially in parallel and/or in real time, so that speaker identity and transcribed speech text may be independently available. Accordingly, computerized intelligent assistant 1300 may be able to provide notifications to the conference participants in real time (e.g., for display at companion devices) indicating that another conference participant is currently speaking and including transcribed speech of the other conference participant, even before the other conference participant has finished speaking. The art also teaches that the transcript may be analyzed using any suitable machine learning (ML) and/or artificial intelligence (AI) techniques, wherein such analysis may include, for raw audio observed during a conference, recognizing text corresponding to the raw audio, and recognizing one or more salient features of the text and/or raw audio. Non-limiting examples of salient features that may be recognized by ML and/or AI techniques include 1) an intent (e.g., an intended task of a conference participant), 2) a context (e.g., a task currently being performed by a conference participant), 3) a topic and/or 4) an action item or commitment (e.g., a task that a conference participant promises to perform))).
It would be obvious to one of ordinary skill in the art at the time before the effective filling date of the claimed invention to modify Sinkov to incorporate the teaching of Diamant. This would be convenient to coordinate the conference, by providing a transcript of the conference to conference participants for subsequent review, tracking arrivals and departures of conference participants, providing cues to conference participants during the conference, and/or analyzing the information in order to summarize one or more aspects of the conference for subsequent review (Parag. [0024]).

Claim 22 is rejected under 35 U.S.C. 103 as being unpatentable over Sinkov et al. (Pub. No. US 2019/0200121), hereinafter Sinkov; in view of Diamant (Pub. No. US 2019/0341050), hereinafter Diamant, and in view of Gleim (Pub. No. US 2015/0063553).

Claim 22. 	Sinkov in view of Diamant discloses the computer-implemented method of claim 21,  
The combination doesn’t explicitly disclose wherein: the first volume captured by the first client device corresponds to a first distance between the first meeting participant and the first client device; and the second volume captured by the second client device corresponds to a second distance between the first meeting participant and the second client device.  
However, Gleim discloses wherein: the first volume captured by the first client device corresponds to a first distance between the first meeting participant and the first client device; and the second volume captured by the second client device corresponds to a second distance between the first meeting participant and the second client device (Parag. [0051]; (The art teaches that the volume level can tell us how loud someone is talking, but it also tells us how far a speaker is from their physical microphone. For 3D sound conferencing, we intentionally level the sound to remove the information about how far the speaker is from their physical microphone so that we can then use an attenuator to intentionally and negative or positive volume information that communicates the distance between the speaker (speaking participant) and the listener (listening participant) in the mapped room. i.e., the volume corresponds to the distance between the meeting participant and the user device, as equivalent to the applicant’s definition)).
It would be obvious to one of ordinary skill in the art at the time before the effective filling date of the claimed invention to modify the combination to incorporate the teaching of Gleim. This would be convenient to enhance virtual learning system in which the participant can feel he or she is really experiencing an actual classroom environment with each user or participant having the ability to distinguish between multiple voices (Parag. [0002]).

Conclusion
		The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Zhao et al. (US 2015/0201082) – Related art in the area of conferencing systems, (Parag. [0003], as the distance between participants using the conference device and the conference device increases, it can become increasingly difficult for remote participants to hear participants using the conference device that are a greater distance from the communication device. Moreover, an apparent volume of participants' voices sharing the conference unit to the remote participant(s) can vary according to a distance each local participant is to the conference device). 

Any inquiry concerning this communication or earlier communications from the examiner should be directed to ABDELBASST TALIOUA whose telephone number is (571)272-4061.  The examiner can normally be reached on Monday-Thursday 7:30 am - 5:30 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, William Trost can be reached on 571-272-7872.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/A.T./Examiner, Art Unit 2442                                                                                                                                                                                                        
/WILLIAM G TROST IV/Supervisory Patent Examiner, Art Unit 2442