DETAILED ACTION

Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Information Disclosure Statement
The information disclosure statements (IDS) submitted on 10/07/2021, 11/11/2021, 12/24/2021, and 01/20/2022 are in compliance with the provisions of 37 CFR 1.97. Accordingly, the information disclosure statement is being considered by the examiner.

Response to Amendment
The amendments filed on November 16, 2021 have been entered.
Claims 1, 9-11, 17-18, and 20 have been amended.

         Response to Arguments
Applicant’s arguments filed on November 16, 2021 have been considered but are moot in view of the new grounds in the current of rejection.













Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.

Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Sinkov et al. (Pub. No. US 2019/0200121), hereinafter Sinkov; in view of Diamant (Pub. No. US 2019/0341050), hereinafter Diamant.

Claim 1. 	Sinkov discloses a computer-implemented method comprising: 
		receiving, by a digital content management system, a first set of audio data from a first client device, the first set of audio data comprising audio content corresponding to speech from a plurality of participants of a meeting captured in a first plurality of volumes (Parag. [0009] and Parag. [0017-0021]; (The art teaches determining which of a plurality of specific personal audio input audio devices correspond to which specific meeting participants, measuring volume levels at each of the personal audio input devices in response to each of the ; 
receiving, by the digital content management system, a second set of audio data from a second client device, the second set of audio data comprising the audio content corresponding to the speech from the plurality of participants of the meeting captured in a second plurality of volumes (Parag. [0009] and Parag. [0017-0021]; (The art teaches determining which of a plurality of specific personal audio input audio devices correspond to which specific meeting participants, measuring volume levels at each of the personal audio input devices in response to each of the meeting participants speaking, identifying that a first particular one of the participants is speaking based on relative volume levels at each of the personal audio input devices. Recording audio information from a meeting may also include simultaneously recording audio input at the first one of the personal audio input audio devices and on the first channel and audio input (i.e., second set of audio data) at the second one of the personal audio input audio devices and on the second channel in response to the first and second meeting participants speaking at the same time. Recording audio information from a meeting may also include filtering the audio input at the first channel and the second channel to separate speech by the first participant from speech by the second participant. Filtering the audio input may be based ; 
determining a segment of the audio content contributed by a first user of the first client device by analyzing, by the digital content management system, the first set of audio data and the second set of audio data (Parag. [0017-0021]; (The art teaches recording audio information from a meeting may also include simultaneously recording audio input at the first one of the personal audio input audio devices and on the first channel and audio input at the second one of the personal audio input audio devices and on the second channel in response to the first and second meeting participants speaking at the same time. Recording audio information from a meeting may also include filtering the audio input at the first channel and the second channel to separate speech by the first participant from speech by the second participant. Filtering the audio input may be based on a distance related volume weakening coefficient, signal latency between the personal audio input devices, and/or ambient noise. In the event of double-talk when two or more speakers talk simultaneously for a period of time, the system may initially identify each speaker, and record double-talk on all principal smartphones owned by current speakers. Once the current speaker is identified, a particular smartphone of the speaker is marked by the system as a principal recording device and the system tracks a corresponding fragment of audio recording by that particular smartphone until a sufficiently long pause when the speaker either stopped talking to change the subject or for other reason or until the current speaker is replaced by another speaker. In either case, the fragment is picked by the system and added to the current channel of the speaker)) to:  
		determine a primary speaking volume associated with the first client device by comparing the first plurality of volumes and the second plurality of volumes (Parag. [0009] and Parag. [0018-0021]; (The art teaches determining which of a plurality of specific personal audio input audio devices correspond to which specific meeting participants, measuring volume ; 
		identify the segment of the audio content based on the primary speaking volume associated with the first client device; and associat (Parag. [0017-0021]; (The art teaches that once the current speaker is identified, a particular smartphone of the speaker is marked by the system as a principal recording device and the system tracks a corresponding fragment of audio recording by that particular smartphone until a sufficiently long pause when the speaker either stopped talking to change the subject or for other reason or until the current speaker is replaced by another speaker. In either case, the fragment is picked by the system and added to the current channel of the speaker. Each channel of each speaker may therefore include subsequent fragments by a single speaker, uninterrupted by others and separated by pauses (such fragments may of course may be merged during post-processing of the meeting recording) or fragments separated in time by audio fragments from other speakers recorded in their channels. In addition to volume .
Sinkov doesn’t explicitly disclose analyzing a transcript of the audio content comprising text representing the speech from the plurality of participants to identify a subset of text representing speech from the segment of the audio content; generating, by the digital content management system, a digital meeting item based on the subset of text representing the speech from the segment of the audio content; and associating, by the digital content management system, the digital meeting item with the first user based on associating the first user with the segment of the audio content. 
However, Diamant discloses:
analyzing a transcript of the audio content comprising text representing the speech from the plurality of participants to identify a subset of text representing speech from the segment of the audio content (Parag. [0052-0053], Parag. [0109], Parag. [0115], and Parag. [0138]; (The art teaches that FIG. 7 is a visual representation of an example output of diarization machine. In FIG. 6, a vertical axis is used to denote WHO (e.g., Bob) is speaking; the horizontal axis denotes WHEN (e.g., 30.01 s-34.87 s) that speaker is speaking; and the depth axis denotes from WHERE (e.g., 23°) that speaker is speaking. Diarization machine 132 uses this WHO/WHEN/WHERE information to label corresponding segments 604 of the audio signal(s) 606 under analysis with labels 608. The segments 604 and/or corresponding labels may be output from the diarization machine 132 in any suitable format. The output effectively associates speech with a particular speaker during a conversation among N speakers, and allows the audio signal corresponding to each speech utterance (with WHO/WHEN/WHERE labeling/metadata) to be used for myriad downstream operations. One nonlimiting downstream operation is conversation transcription. FIG. 1B teaches a computerized conference assistant 106 may include a speech recognition machine 130. As shown in FIG. 8, the speech recognition machine 130 may be configured to translate an audio signal of recorded speech (e.g., 112, beamformed signal 150, signal 606, and/or segments 604) into text 800. The art also teaches that in some examples, transcribed speech and/or speaker identity information may be gathered by computerized intelligent assistant 1300 in real time, in order to build the transcript in real time, and/or in order to provide notifications to conference participants about the transcribed speech in real time. In some examples, computerized intelligent assistant 1300 may be configured, for a stream of speech audio captured by a microphone, to identify a current speaker and to analyze the speech audio in order to transcribe speech text, substantially in parallel and/or in real time, so that speaker identity and transcribed speech text may be independently available. Accordingly, computerized intelligent assistant 1300 may be able to provide notifications to the conference participants in real time (e.g., for display at companion devices) indicating that another conference participant is currently speaking and including transcribed speech of the other conference participant, even before the other conference participant has finished speaking. The art also teaches that the transcript may be analyzed using any suitable machine learning (ML) and/or artificial intelligence (AI) techniques, wherein such analysis may include, for raw audio observed during a conference, recognizing text corresponding to the raw audio, and recognizing one or more salient features of the text and/or raw audio. Non-limiting examples of salient features that may be recognized by ML and/or AI techniques include 1) an intent (e.g., an intended task of a conference participant), 2) a context (e.g., a task currently being performed by a conference participant), 3) a topic and/or 4) an action item or commitment (e.g., a task that a conference participant promises to perform))); 
generating, by the digital content management system, a digital meeting item based on the subset of text representing the speech from the segment of the audio content (Parag. [0004], Parag. [0024], Parag. [0082], Parag. [0109], and Parag. [0138]; (The art teaches that the conference transcript can be used by participants for reviewing various multi-modal interactions and other events of interest that happened in the conference. The conference transcript can be analyzed to provide conference participants with feedback regarding their own participation in the conference, other participants, and team/organizational trends. The art also teaches that the transcript may be analyzed using any suitable machine learning (ML) and/or artificial intelligence (AI) techniques, wherein such analysis may include, for raw audio observed during a conference, recognizing text corresponding to the raw audio, and recognizing one or more salient features of the text and/or raw audio. Non-limiting examples of salient features that may be 211, e.g., transcribed speech audio in the form of text. The transcription machine may be configured to analyze the transcript to detect words having a predefined sentiment (e.g., positive, negative, “happy”, or any other suitable sentiment), in order to present a sentiment analysis summary at a companion device of a conference participant, indicating a frequency of utterance of words having the predefined sentiment)); and
associating, by the digital content management system, the digital meeting item with the first user based on associating the first user with the segment of the audio content (Parag. [0052-0053], Parag. [0109], Parag. [0115], and Parag. [0138]; (The art teaches that FIG. 7 is a visual representation of an example output of diarization machine. In FIG. 6, a vertical axis is used to denote WHO (e.g., Bob) is speaking; the horizontal axis denotes WHEN (e.g., 30.01 s-34.87 s) that speaker is speaking; and the depth axis denotes from WHERE (e.g., 23°) that speaker is speaking. Diarization machine 132 uses this WHO/WHEN/WHERE information to label corresponding segments 604 of the audio signal(s) 606 under analysis with labels 608. The segments 604 and/or corresponding labels may be output from the diarization machine 132 in any suitable format. The output effectively associates speech with a particular speaker during a conversation among N speakers, and allows the audio signal corresponding to each speech utterance (with WHO/WHEN/WHERE labeling/metadata) to be used for myriad downstream operations. One nonlimiting downstream operation is conversation transcription. The art also teaches that the transcript may be analyzed using any suitable machine learning (ML) and/or artificial intelligence (AI) techniques, wherein such analysis may include, for raw audio observed during a conference, recognizing text corresponding to the raw audio, and recognizing one or more salient features of the text and/or raw audio. Non-limiting examples of salient features that may be recognized by ML and/or AI techniques include 1) an intent (e.g., an intended task of a conference participant), 2) a context (e.g., a task currently being performed by a conference participant), 3) a topic and/or 4) an action item or commitment (e.g., a task that a conference participant promises to perform) (i.e., meeting item). Also, in an example, the art teaches that the 211, e.g., transcribed speech audio in the form of text. The transcription machine may be configured to analyze the transcript to detect words having a predefined sentiment (e.g., positive, negative, “happy”, or any other suitable sentiment), in order to present a sentiment analysis summary at a companion device of a conference participant (i.e., first user), indicating a frequency of utterance of words having the predefined sentiment)).  
It would be obvious to one of ordinary skill in the art at the time before the effective filling date of the claimed invention to modify Sinkov to incorporate the teaching of Diamant. This would be convenient to coordinate the conference, by providing a transcript of the conference to conference participants for subsequent review, tracking arrivals and departures of conference participants, providing cues to conference participants during the conference, and/or analyzing the information in order to summarize one or more aspects of the conference for subsequent review (Parag. [0024]).

Claim 2. 	Sinkov in view of Diamant discloses the computer-implemented method of claim 1, 
Sinkov further discloses wherein: the first set of audio data further comprises a time-based record of the first plurality of volumes captured by the first client device; and analyzing the first set of audio data to determine the primary speaking volume associated with the first client device comprises analyzing the time-based record of the first plurality of volumes to determine the primary speaking volume (Parag. [0009] and Parag. [0017-0021]; (The art teaches determining which of a plurality of specific personal audio input audio devices correspond to which specific meeting participants, measuring volume levels at each of the personal audio input devices in response to each of the meeting participants speaking, identifying that a first particular one of the participants is speaking based on relative volume levels at each of the personal audio input devices. Recording audio information from a meeting may also include simultaneously recording audio input (i.e., first set of audio data) at the first one of the personal audio input audio devices and on the first channel and audio input at the second one of the personal audio input audio devices and on the second channel in response to the first and second meeting participants speaking at the same time. Recording audio information from a meeting may also include filtering the audio input at the first channel and the second .  
 
Claim 3. 	Sinkov in view of Diamant discloses the computer-implemented method of claim 1,  
Sinkov further discloses wherein analyzing the first set of audio data and the second set of audio data to determine the primary speaking volume associated with the first client device comprises: identifying a highest speaking volume as the primary speaking volume based on comparing th(Parag. [0009] and Parag. [0018-0021]; (The art teaches determining which of a plurality of specific personal audio input audio devices correspond to which specific meeting participants, measuring volume levels at each of the personal audio input devices in response to each of the meeting participants speaking, identifying that a first particular one of the participants is speaking based on relative volume levels at each of the personal audio input devices. Recording audio information from a meeting may also include simultaneously recording audio input at the first one of the personal audio input audio devices and on the first channel and audio input at the second one of the personal audio input audio devices and on the second channel in response to the first and second meeting participants speaking at the same time. Recording audio information from a meeting may also include filtering the audio input at the first channel and the second channel to separate speech by the first participant from speech by the second participant. Filtering the audio input may be based on a distance related volume weakening coefficient, signal latency between the personal audio input devices, and/or ambient noise. In the event of double-talk when two or more speakers talk simultaneously for a period of time, the system may initially identify each speaker, and record double-talk on all principal smartphones owned by current speakers. The availability of two symmetric cross-recordings may facilitate assessing the coefficients (after an initial cancellation of ambient noises) and filtering out the weaker components using, for example, echo cancellation technique. Even if the double-talk suppression .  
Page 3 of 17
Claim 4. 	Sinkov in view of Diamant discloses the computer-implemented method of claim 1,  
Sinkov doesn’t explicitly disclose the computer-implemented method further comprising receiving, from a computer application installed on the first client device, an authentication of the first user, wherein associating the first user with the segment of the audio content is further based on the authentication of the first user.   
However, Diamant discloses:  
receiving, from a computer application installed on the first client device, an authentication of the first user (Parag. [0095]; (The art teaches that Computerized intelligent assistant may be configured to track the arrival of a remote participant based on the remote participant logging in to a remote conferencing program (e.g., a messaging application, voice and/or video chat application, or any other suitable interface for remote interaction))), wherein associating the first user with the segment of the audio content is further based on the authentication of the first user (Parag. [0052-0053], Parag. [0109], Parag. [0115], and Parag. [0138]; (The art teaches that FIG. 7 is a visual representation of an example output of diarization machine. In FIG. 6, a vertical axis is used to denote WHO (e.g., Bob) is speaking; the horizontal axis denotes WHEN (e.g., 30.01 s-34.87 s) that speaker is speaking; and the depth axis denotes from WHERE (e.g., 23°) that speaker is speaking. Diarization machine 132 uses this WHO/WHEN/WHERE information to label corresponding segments 604 of the audio signal(s) 606 under analysis with labels 608)).
It would be obvious to one of ordinary skill in the art at the time before the effective filling date of the claimed invention to modify Sinkov to incorporate the teaching of Diamant. This would be convenient to coordinate the conference, by providing a transcript of the conference to conference participants for subsequent review, tracking arrivals and departures of conference participants, providing cues to conference participants during the conference, and/or 

Claim 5. 	Sinkov in view of Diamant discloses the computer-implemented method of claim 1,  
Sinkov doesn’t explicitly disclose further comprising: receiving video data from th
However, Diamant discloses:   
receiving video data from th(Parag. [0004], Parag. [0023], and Parag. [0031]; (The art teaches receiving a digital video and a computer-readable audio signal. A face recognition machine is operated to recognize a face of a first conference participant in the digital video, and a speech recognition machine is operated to translate the computer-readable audio signal into a first text. The art teaches that computerized conference assistant 106 includes a face location machine 124 and a face identification machine 126. As shown in FIG. 4, face location machine 124 is configured to find candidate faces 166 in digital video 114. As an example, FIG. 4 shows face location machine 124 finding candidate FACE(1) at 23°, candidate FACE(2) at 178°, and candidate FACE(3) at 303°)); and 
analyzing the video content to identify the first user, wherein associating the first user with the segment of the audio content is further based on analyzing the video content to identify the first user (Parag. [0032] and Parag. [0052-0053]; (The art teaches that face identification machine 126 optionally may be configured to determine an identity 168 of each candidate face 166 by analyzing just the portions of the digital video 114 where candidate faces 166 have been found. In other implementations, the face location step may be omitted, and the face identification machine may analyze a larger portion of the digital video 114 to identify faces. FIG. 5 shows an example in which face identification machine 126 identifies candidate FACE(1) as “Bob,” candidate FACE(2) as “Charlie,” and candidate FACE(3) as “Alice.” While not shown, each identity 168 may have an associated confidence value, and two or more 168 having different confidence values may be found for the same face (e.g., Bob(88%), Bert (33%)). Further, the art teaches that FIG. 7 is a visual representation of an example output of diarization machine. In FIG. 6, a vertical axis is used to denote WHO (e.g., Bob) is speaking; the horizontal axis denotes WHEN (e.g., 30.01 s-34.87 s) that speaker is speaking; and the depth axis denotes from WHERE (e.g., 23°) that speaker is speaking. Diarization machine 132 uses this WHO/WHEN/WHERE information to label corresponding segments 604 of the audio signal(s) 606 under analysis with labels 608)).  
It would be obvious to one of ordinary skill in the art at the time before the effective filling date of the claimed invention to modify Sinkov to incorporate the teaching of Diamant. This would be convenient to coordinate the conference, by providing a transcript of the conference to conference participants for subsequent review, tracking arrivals and departures of conference participants, providing cues to conference participants during the conference, and/or analyzing the information in order to summarize one or more aspects of the conference for subsequent review (Parag. [0024]). 

Claim 6. 	Sinkov in view of Diamant discloses the computer-implemented method of claim 5,  
Sinkov doesn’t explicitly disclose wherein analyzing the video content to identify the first user comprises utilizing a facial recognition model to determine an identity of the first user based on the video content. 
However, Diamant discloses wherein analyzing the video content to identify the first user comprises utilizing a facial recognition model to determine an identity of the first user based on the video content (Parag. [0004], Parag. [0023], and Parag. [0031]; (The art teaches receiving a digital video and a computer-readable audio signal. A face recognition machine is operated to recognize a face of a first conference participant in the digital video, and a speech recognition machine is operated to translate the computer-readable audio signal into a first text. The art teaches that computerized conference assistant 106 includes a face location machine 124 and a face identification machine 126. As shown in FIG. 4, face location machine 124 is configured to find candidate faces 166 in digital video 114. As an example, FIG. 4 shows face location machine 124 finding candidate FACE(1) at 23°, candidate FACE(2) at 178°, and candidate FACE(3) at 303°)). 


Claim 7. 	Sinkov in view of Diamant discloses the computer-implemented method of claim 1,   
Sinkov doesn’t explicitly disclose wherein the digital meeting item comprises at least one of: a meeting transcript of the audio content associated with the meeting; a participation report comprising participation details corresponding to one or more users associated with the meeting; an action item; a message; a notification; or a calendar item. 
However, Diamant discloses wherein the digital meeting item comprises at least one of: a meeting transcript of the audio content associated with the meeting; a participation report comprising participation details corresponding to one or more users associated with the meeting; an action item; a message; a notification; or a calendar item (Parag. [0004], Parag. [0024], Parag. [0082], Parag. [0109], and Parag. [0138]; (The art teaches that the conference transcript can be used by participants for reviewing various multi-modal interactions and other events of interest that happened in the conference. The conference transcript can be analyzed to provide conference participants with feedback regarding their own participation in the conference, other participants, and team/organizational trends. The art also teaches that the transcript may be analyzed using any suitable machine learning (ML) and/or artificial intelligence (AI) techniques, wherein such analysis may include, for raw audio observed during a conference, recognizing text corresponding to the raw audio, and recognizing one or more salient features of the text and/or raw audio. Non-limiting examples of salient features that may be recognized by ML and/or AI techniques include 1) an intent (e.g., an intended task of a conference participant), 2) a context (e.g., a task currently being performed by a conference participant), 3) a topic and/or 4) an action item or commitment (e.g., a task that a conference participant promises to perform) (i.e., meeting item). Also, in an example, the art teaches that the 211, e.g., transcribed speech audio in the form of text. The transcription machine may be configured to analyze the transcript to detect words having a predefined sentiment (e.g., positive, negative, “happy”, or any other suitable sentiment), in order to present a sentiment analysis summary at a companion device of a conference participant, indicating a frequency of utterance of words having the predefined sentiment)))). 
It would be obvious to one of ordinary skill in the art at the time before the effective filling date of the claimed invention to modify Sinkov to incorporate the teaching of Diamant. This would be convenient to coordinate the conference, by providing a transcript of the conference to conference participants for subsequent review, tracking arrivals and departures of conference participants, providing cues to conference participants during the conference, and/or analyzing the information in order to summarize one or more aspects of the conference for subsequent review (Parag. [0024]). 

Claim 8. 	Sinkov in view of Diamant discloses the computer-implemented method of claim 7,  
Sinkov doesn’t explicitly disclose wherein: the digital meeting item comprises the meeting transcript; and associating the digital meeting item with the first user comprises: generating an identification tag corresponding to the first user; and modifying the meeting transcript by associating the identification tag with the segment of the audio content.  
However, Diamant discloses wherein:  
the digital meeting item comprises the meeting transcript; and associating the digital meeting item with the first user comprises: generating an identification tag corresponding to the first user; and modifying the meeting transcript by associating the identification tag with the segment of the audio content (Parag. [0052-0053], Parag. [0109], Parag. [0115], and Parag. [0138]; (The art teaches that FIG. 7 is a visual representation of an example output of diarization machine. In FIG. 6, a vertical axis is used to denote WHO (e.g., Bob) is speaking; the horizontal axis denotes WHEN (e.g., 30.01 s-34.87 s) that speaker is speaking; and the depth axis denotes from WHERE (e.g., 23°) that speaker is speaking. Diarization machine 132 uses this WHO/WHEN/WHERE information to label corresponding segments 604 of the audio signal(s) 606 under analysis with labels 608. The segments 604 and/or 132 in any suitable format. The output effectively associates speech with a particular speaker during a conversation among N speakers, and allows the audio signal corresponding to each speech utterance (with WHO/WHEN/WHERE labeling/metadata) to be used for myriad downstream operations. One nonlimiting downstream operation is conversation transcription. The art also teaches that the transcript may be analyzed using any suitable machine learning (ML) and/or artificial intelligence (AI) techniques, wherein such analysis may include, for raw audio observed during a conference, recognizing text corresponding to the raw audio, and recognizing one or more salient features of the text and/or raw audio. Non-limiting examples of salient features that may be recognized by ML and/or AI techniques include 1) an intent (e.g., an intended task of a conference participant), 2) a context (e.g., a task currently being performed by a conference participant), 3) a topic and/or 4) an action item or commitment (e.g., a task that a conference participant promises to perform) (i.e., meeting item), Also, in an example, the art teaches that the machine learning classifier may be configured to receive any other suitable transcript data automatically recorded at 211, e.g., transcribed speech audio in the form of text. The transcription machine may be configured to analyze the transcript to detect words having a predefined sentiment (e.g., positive, negative, “happy”, or any other suitable sentiment), in order to present a sentiment analysis summary at a companion device of a conference participant (i.e., first user), indicating a frequency of utterance of words having the predefined sentiment)).  
It would be obvious to one of ordinary skill in the art at the time before the effective filling date of the claimed invention to modify Sinkov to incorporate the teaching of Diamant. This would be convenient to coordinate the conference, by providing a transcript of the conference to conference participants for subsequent review, tracking arrivals and departures of conference participants, providing cues to conference participants during the conference, and/or analyzing the information in order to summarize one or more aspects of the conference for subsequent review (Parag. [0024]).

Claim 9. 	Sinkov in view of Diamant discloses the computer-implemented method of claim 7,   
Sinkov doesn’t explicitly disclose wherein: the digital meeting item comprises the action item including content discussed during the meeting; and associating the digital meeting item with the first user comprises: generating an action item prompt to complete the action item; and providing the action item prompt for display on the first client device. 
However, Diamant discloses that the digital meeting item comprises the action item including content discussed during the meeting; and associating the digital meeting item with the first user comprises: generating an action item prompt to complete the action item; and providing the action item prompt for display on the first client device (Parag. [0052-0053], Parag. [0109], Parag. [0115], and Parag. [0138]; (The art teaches that FIG. 7 is a visual representation of an example output of diarization machine. In FIG. 6, a vertical axis is used to denote WHO (e.g., Bob) is speaking; the horizontal axis denotes WHEN (e.g., 30.01 s-34.87 s) that speaker is speaking; and the depth axis denotes from WHERE (e.g., 23°) that speaker is speaking. Diarization machine 132 uses this WHO/WHEN/WHERE information to label corresponding segments 604 of the audio signal(s) 606 under analysis with labels 608. The segments 604 and/or corresponding labels may be output from the diarization machine 132 in any suitable format. The output effectively associates speech with a particular speaker during a conversation among N speakers, and allows the audio signal corresponding to each speech utterance (with WHO/WHEN/WHERE labeling/metadata) to be used for myriad downstream operations. One nonlimiting downstream operation is conversation transcription. The art also teaches that the transcript may be analyzed using any suitable machine learning (ML) and/or artificial intelligence (AI) techniques, wherein such analysis may include, for raw audio observed during a conference, recognizing text corresponding to the raw audio, and recognizing one or more salient features of the text and/or raw audio. Non-limiting examples of salient features that may be recognized by ML and/or AI techniques include 1) an intent (e.g., an intended task of a conference participant), 2) a context (e.g., a task currently being performed by a conference participant), 3) a topic and/or 4) an action item or commitment (e.g., a task that a conference participant promises to perform) (i.e., meeting item))). 
It would be obvious to one of ordinary skill in the art at the time before the effective filling date of the claimed invention to modify Sinkov to incorporate the teaching of Diamant. This would be convenient to coordinate the conference, by providing a transcript of the conference to conference participants for subsequent review, tracking arrivals and departures of conference participants, providing cues to conference participants during the conference, and/or 
Page 5 of 17
Claim 10. 	Sinkov in view of Diamant discloses the computer-implemented method of claim 1,  
Sinkov doesn’t explicitly disclose the computer-implemented method further comprising generating the transcript of the audio content based on at least one of the first set of audio data or the second set of audio data.
However, Diamant discloses further comprising generating the transcript of the audio content based on at least one of the first set of audio data or the second set of audio data (Parag. [0060]; (The art teaches that Labeled and/or partially labelled audio segments may be used to not only determine which of a plurality of N speakers is responsible for an utterance, but also translate the utterance into a textural representation for downstream operations, such as transcription)). Page 6 of 17  
It would be obvious to one of ordinary skill in the art at the time before the effective filling date of the claimed invention to modify Sinkov to incorporate the teaching of Diamant. This would be convenient to coordinate the conference, by providing a transcript of the conference to conference participants for subsequent review, tracking arrivals and departures of conference participants, providing cues to conference participants during the conference, and/or analyzing the information in order to summarize one or more aspects of the conference for subsequent review (Parag. [0024]).

Claim 11. 	Sinkov discloses a non-transitory computer readable storage medium comprising instructions that, when executed by at least one processor (Parag. [0009-0010]), cause a computing device to: 
receive a first set of audio data from a first client device, the first set of audio data comprising audio content corresponding to speech from a plurality of participants of a meeting captured in a first plurality of volumes (Parag. [0009] and Parag. [0017-0021]; (The art teaches determining which of a plurality of specific personal audio input audio devices correspond to which specific meeting participants, measuring volume levels at each of the personal audio input devices in response to each of the meeting participants speaking, identifying ;  
receive a second set of audio data from a second client device, the second set of audio data comprising the audio content corresponding to the speech from the plurality of participants of the meeting captured in a second plurality of volumes (Parag. [0009] and Parag. [0017-0021]; (The art teaches determining which of a plurality of specific personal audio input audio devices correspond to which specific meeting participants, measuring volume levels at each of the personal audio input devices in response to each of the meeting participants speaking, identifying that a first particular one of the participants is speaking based on relative volume levels at each of the personal audio input devices. Recording audio information from a meeting may also include simultaneously recording audio input at the first one of the personal audio input audio devices and on the first channel and audio input (i.e., second set of audio data) at the second one of the personal audio input audio devices and on the second channel in response to the first and second meeting participants speaking at the same time. Recording audio information from a meeting may also include filtering the audio input at the first channel and the second channel to separate speech by the first participant from speech by the second participant. Filtering the audio input may be based on a distance related volume weakening coefficient, , 
determine a segment of the audio content contributed by a first user of the first client device by analyzing the first set of audio data and the second set of audio data (Parag. [0017-0021]; (The art teaches recording audio information from a meeting may also include simultaneously recording audio input at the first one of the personal audio input audio devices and on the first channel and audio input at the second one of the personal audio input audio devices and on the second channel in response to the first and second meeting participants speaking at the same time. Recording audio information from a meeting may also include filtering the audio input at the first channel and the second channel to separate speech by the first participant from speech by the second participant. Filtering the audio input may be based on a distance related volume weakening coefficient, signal latency between the personal audio input devices, and/or ambient noise. In the event of double-talk when two or more speakers talk simultaneously for a period of time, the system may initially identify each speaker, and record double-talk on all principal smartphones owned by current speakers. Once the current speaker is identified, a particular smartphone of the speaker is marked by the system as a principal recording device and the system tracks a corresponding fragment of audio recording by that particular smartphone until a sufficiently long pause when the speaker either stopped talking to change the subject or for other reason or until the current speaker is replaced by another speaker. In either case, the fragment is picked by the system and added to the current channel of the speaker)) to:  
determine a primary speaking volume associated with the first client device by comparing the first plurality of volumes and the second plurality of volumes (Parag. [0009] and Parag. [0018-0021]; (The art teaches determining which of a plurality of specific personal audio input audio devices correspond to which specific meeting participants, measuring volume levels at each of the personal audio input devices in response to each of the meeting participants ;  
identify the segment of the audio content based on the primary speaking volume associated with the first client device; and associat(Parag. [0017-0021]; (The art teaches that once the current speaker is identified, a particular smartphone of the speaker is marked by the system as a principal recording device and the system tracks a corresponding fragment of audio recording by that particular smartphone until a sufficiently long pause when the speaker either stopped talking to change the subject or for other reason or until the current speaker is replaced by another speaker. In either case, the fragment is picked by the system and added to the current channel of the speaker. Each channel of each speaker may therefore include subsequent fragments by a single speaker, uninterrupted by others and separated by pauses (such fragments may of course may be merged during post-processing of the meeting recording) or fragments separated in time by audio fragments from other speakers recorded in their channels. In addition to volume characteristics, latency of signal reception and explicit voice ID or audio profile of each speaker, ; and
		analyze a transcript of the segment of the audio content to identify text representing speech from the segment of the audio content (Parag. [0010], Parag. [0025], and Parag. [0040]; (The art teaches that At least a portion of the storyboard may be transcribed using voice-to-text transcription)). Page 7 of 17



Sinkov doesn’t explicitly disclose analyze a transcript of the audio content comprising text representing the speech from the plurality of participants to identify a subset of text representing speech from the segment of the audio content; generate a digital meeting item based on the subset of text representing the speech from the segment of the audio content; and associate the digital meeting item with the first user based on associating the first user with the segment of the audio content.  
However, Diamant discloses:  
analyze a transcript of the audio content comprising text representing the speech from the plurality of participants to identify a subset of text representing speech from the segment of the audio content (Parag. [0052-0053], Parag. [0109], Parag. [0115], and Parag. [0138]; (The art teaches that FIG. 7 is a visual representation of an example output of diarization machine. In FIG. 6, a vertical axis is used to denote WHO (e.g., Bob) is speaking; the horizontal axis denotes WHEN (e.g., 30.01 s-34.87 s) that speaker is speaking; and the depth axis denotes from WHERE (e.g., 23°) that speaker is speaking. Diarization machine 132 uses this WHO/WHEN/WHERE information to label corresponding segments 604 of the audio signal(s) 606 under analysis with labels 608. The segments 604 and/or corresponding labels may be output from the diarization machine 132 in any suitable format. The output effectively associates speech with a particular speaker during a conversation among N speakers, and allows the audio signal corresponding to each speech utterance (with WHO/WHEN/WHERE labeling/metadata) to be used for myriad downstream operations. One nonlimiting downstream operation is conversation transcription. FIG. 1B teaches a computerized conference 106 may include a speech recognition machine 130. As shown in FIG. 8, the speech recognition machine 130 may be configured to translate an audio signal of recorded speech (e.g., signals 112, beamformed signal 150, signal 606, and/or segments 604) into text 800. The art also teaches that in some examples, transcribed speech and/or speaker identity information may be gathered by computerized intelligent assistant 1300 in real time, in order to build the transcript in real time, and/or in order to provide notifications to conference participants about the transcribed speech in real time. In some examples, computerized intelligent assistant 1300 may be configured, for a stream of speech audio captured by a microphone, to identify a current speaker and to analyze the speech audio in order to transcribe speech text, substantially in parallel and/or in real time, so that speaker identity and transcribed speech text may be independently available. Accordingly, computerized intelligent assistant 1300 may be able to provide notifications to the conference participants in real time (e.g., for display at companion devices) indicating that another conference participant is currently speaking and including transcribed speech of the other conference participant, even before the other conference participant has finished speaking. The art also teaches that the transcript may be analyzed using any suitable machine learning (ML) and/or artificial intelligence (AI) techniques, wherein such analysis may include, for raw audio observed during a conference, recognizing text corresponding to the raw audio, and recognizing one or more salient features of the text and/or raw audio. Non-limiting examples of salient features that may be recognized by ML and/or AI techniques include 1) an intent (e.g., an intended task of a conference participant), 2) a context (e.g., a task currently being performed by a conference participant), 3) a topic and/or 4) an action item or commitment (e.g., a task that a conference participant promises to perform)));  
generate a digital meeting item based on the subset of text representing the speech from the segment of the audio content (Parag. [0004], Parag. [0024], Parag. [0082], Parag. [0109], and Parag. [0138]; (The art teaches that the conference transcript can be used by participants for reviewing various multi-modal interactions and other events of interest that happened in the conference. The conference transcript can be analyzed to provide conference participants with feedback regarding their own participation in the conference, other participants, and team/organizational trends. The art also teaches that the transcript may be analyzed using any suitable machine learning (ML) and/or artificial intelligence (AI) techniques, wherein such analysis may include, for raw audio observed during a conference, recognizing text 211, e.g., transcribed speech audio in the form of text. The transcription machine may be configured to analyze the transcript to detect words having a predefined sentiment (e.g., positive, negative, “happy”, or any other suitable sentiment), in order to present a sentiment analysis summary at a companion device of a conference participant, indicating a frequency of utterance of words having the predefined sentiment)); and 
associate the digital meeting item with the first user based on associating the first user with the segment of the audio content (Parag. [0052-0053], Parag. [0109], Parag. [0115], and Parag. [0138]; (The art teaches that FIG. 7 is a visual representation of an example output of diarization machine. In FIG. 6, a vertical axis is used to denote WHO (e.g., Bob) is speaking; the horizontal axis denotes WHEN (e.g., 30.01 s-34.87 s) that speaker is speaking; and the depth axis denotes from WHERE (e.g., 23°) that speaker is speaking. Diarization machine 132 uses this WHO/WHEN/WHERE information to label corresponding segments 604 of the audio signal(s) 606 under analysis with labels 608. The segments 604 and/or corresponding labels may be output from the diarization machine 132 in any suitable format. The output effectively associates speech with a particular speaker during a conversation among N speakers, and allows the audio signal corresponding to each speech utterance (with WHO/WHEN/WHERE labeling/metadata) to be used for myriad downstream operations. One nonlimiting downstream operation is conversation transcription. The art also teaches that the transcript may be analyzed using any suitable machine learning (ML) and/or artificial intelligence (AI) techniques, wherein such analysis may include, for raw audio observed during a conference, recognizing text corresponding to the raw audio, and recognizing one or more salient features of the text and/or raw audio. Non-limiting examples of salient features that may be recognized by ML and/or AI techniques include 1) an intent (e.g., an intended task of a conference participant), 2) a context (e.g., a task currently being performed by a conference participant), 3) a topic and/or 4) an action 211, e.g., transcribed speech audio in the form of text. The transcription machine may be configured to analyze the transcript to detect words having a predefined sentiment (e.g., positive, negative, “happy”, or any other suitable sentiment), in order to present a sentiment analysis summary at a companion device of a conference participant (i.e., first user), indicating a frequency of utterance of words having the predefined sentiment)).   
It would be obvious to one of ordinary skill in the art at the time before the effective filling date of the claimed invention to modify Sinkov to incorporate the teaching of Diamant. This would be convenient to coordinate the conference, by providing a transcript of the conference to conference participants for subsequent review, tracking arrivals and departures of conference participants, providing cues to conference participants during the conference, and/or analyzing the information in order to summarize one or more aspects of the conference for subsequent review (Parag. [0024]).

Claim 12. 	Sinkov in view of Diamant discloses the non-transitory computer readable storage medium of claim 11, 
Sinkov further discloses wherein: the first set of audio data further comprises volume data corresponding to th(Parag. [0009] and Parag. [0017-0021]; (The art teaches determining which of a plurality of specific personal audio input audio devices correspond to which specific meeting participants, measuring volume levels at each of the personal audio input devices in response to each of the meeting participants speaking, identifying that a first particular one of the participants is speaking based on relative volume levels at each of the personal audio input devices. Recording audio information from a meeting may also include simultaneously recording audio input (i.e., first set of audio data) at the first one of the personal audio input audio devices and on the first channel and audio input at the second one of the personal audio input audio devices and on the second channel in response to the first and .  
 
Claim 13. 	Sinkov in view of Diamant discloses the non-transitory computer readable storage medium of claim 11,   
Sinkov doesn’t explicitly disclose the non-transitory computer readable storage medium further comprising instructions that, when executed by the at least one processor, cause the computing device to track participation data corresponding to the first user based on the segment of the audio content, wherein the instructions, when executed by the at least one processor, cause the computing device to generate the digital meeting item by generating a participation report based on the participation data.  
However, Diamant discloses instructions that, when executed by the at least one processor, cause the computing device to track participation data corresponding to the first user based on the segment of the audio content, wherein the instructions, when executed by the at least one processor, cause the computing device to generate the digital meeting item by generating a participation report based on the participation data (Parag. [0004], Parag. [0023-0024], Parag. [0082], Parag. [0109], and Parag. [0138]; (The art teaches that the conference transcript can be used by participants for reviewing various multi-modal interactions and other events of interest that happened in the conference. The conference transcript can be analyzed to provide conference participants with feedback regarding their own participation in the conference, other participants, and team/organizational trends. The art also teaches that the transcript may be analyzed using any suitable machine learning (ML) and/or artificial intelligence (AI) techniques, wherein such analysis may include, for raw audio observed during a conference, recognizing text corresponding to the raw audio, and recognizing one or more salient features of the text and/or raw audio. Non-limiting examples of salient features that may be recognized by ML and/or AI techniques include 1) an intent (e.g., an intended task of a 211, e.g., transcribed speech audio in the form of text. The transcription machine may be configured to analyze the transcript to detect words having a predefined sentiment (e.g., positive, negative, “happy”, or any other suitable sentiment), in order to present a sentiment analysis summary at a companion device of a conference participant, indicating a frequency of utterance of words having the predefined sentiment)))). 
It would be obvious to one of ordinary skill in the art at the time before the effective filling date of the claimed invention to modify Sinkov to incorporate the teaching of Diamant. This would be convenient to coordinate the conference, by providing a transcript of the conference to conference participants for subsequent review, tracking arrivals and departures of conference participants, providing cues to conference participants during the conference, and/or analyzing the information in order to summarize one or more aspects of the conference for subsequent review (Parag. [0024]).
Page 8 of 17
Claim 14. 	Sinkov in view of Diamant discloses the non-transitory computer readable storage medium of claim 13,   
Sinkov doesn’t explicitly disclose wherein the participation data includes at least one of a length of time spoken by the first user or a number of interruptions by the first user. 
However, Diamant discloses wherein the participation data includes at least one of a length of time spoken by the first user or a number of interruptions by the first user (Parag. [0052-0053], Parag. [0109], Parag. [0115], and Parag. [0138]; (The art teaches that FIG. 7 is a visual representation of an example output of diarization machine. In FIG. 6, a vertical axis is used to denote WHO (e.g., Bob) is speaking; the horizontal axis denotes WHEN (e.g., 30.01 s-34.87 s) that speaker is speaking; and the depth axis denotes from WHERE (e.g., 23°) that speaker is speaking)).  
It would be obvious to one of ordinary skill in the art at the time before the effective filling date of the claimed invention to modify Sinkov to incorporate the teaching of Diamant. This would be convenient to coordinate the conference, by providing a transcript of the 

Claim 15. 	Sinkov in view of Diamant discloses the non-transitory computer readable storage medium of claim 11,   
Sinkov doesn’t explicitly disclose wherein the instructions, when executed by the at least one processor, cause the computing device to associate the digital meeting item with the first user by providing the digital meeting item for display on the first client device. 
However, Diamant discloses wherein the instructions, when executed by the at least one processor, cause the computing device to associate the digital meeting item with the first user by providing the digital meeting item for display on the first client device (Parag. [0052-0053], Parag. [0109], Parag. [0115], and Parag. [0138]; (The art teaches that FIG. 7 is a visual representation of an example output of diarization machine. In FIG. 6, a vertical axis is used to denote WHO (e.g., Bob) is speaking; the horizontal axis denotes WHEN (e.g., 30.01 s-34.87 s) that speaker is speaking; and the depth axis denotes from WHERE (e.g., 23°) that speaker is speaking. Diarization machine 132 uses this WHO/WHEN/WHERE information to label corresponding segments 604 of the audio signal(s) 606 under analysis with labels 608. The segments 604 and/or corresponding labels may be output from the diarization machine 132 in any suitable format. The output effectively associates speech with a particular speaker during a conversation among N speakers, and allows the audio signal corresponding to each speech utterance (with WHO/WHEN/WHERE labeling/metadata) to be used for myriad downstream operations. One nonlimiting downstream operation is conversation transcription. The art also teaches that the transcript may be analyzed using any suitable machine learning (ML) and/or artificial intelligence (AI) techniques, wherein such analysis may include, for raw audio observed during a conference, recognizing text corresponding to the raw audio, and recognizing one or more salient features of the text and/or raw audio. Non-limiting examples of salient features that may be recognized by ML and/or AI techniques include 1) an intent (e.g., an intended task of a conference participant), 2) a context (e.g., a task currently being performed by a conference participant), 3) a topic and/or 4) an action item or commitment (e.g., a task that a conference 211, e.g., transcribed speech audio in the form of text. The transcription machine may be configured to analyze the transcript to detect words having a predefined sentiment (e.g., positive, negative, “happy”, or any other suitable sentiment), in order to present a sentiment analysis summary at a companion device of a conference participant (i.e., first user), indicating a frequency of utterance of words having the predefined sentiment)).  
It would be obvious to one of ordinary skill in the art at the time before the effective filling date of the claimed invention to modify Sinkov to incorporate the teaching of Diamant. This would be convenient to coordinate the conference, by providing a transcript of the conference to conference participants for subsequent review, tracking arrivals and departures of conference participants, providing cues to conference participants during the conference, and/or analyzing the information in order to summarize one or more aspects of the conference for subsequent review (Parag. [0024]).

Claim 16. 	Sinkov in view of Diamant discloses the non-transitory computer readable storage medium of claim 11,   
Sinkov doesn’t explicitly disclose the non-transitory computer readable storage medium further comprising instructions that, when executed by the at least one processor, cause the computing device to receive, from a computer application installed on the first client device, an authentication of the first user generated by submission of one or more login credentials by the first user via the first client device, wherein the instructions, when executed by the at least one processor, cause the computing device to associate the first user with the segment of the audio content further based on the authentication of the first user. 
However, Diamant discloses instructions that, when executed by the at least one processor, cause the computing device to receive, from a computer application installed on the first client device, an authentication of the first user generated by submission of one or more login credentials by the first user via the first client device (Parag. [0095]; (The art teaches that Computerized intelligent assistant may be configured to track the arrival of a remote participant based on the remote participant logging in to a remote conferencing program (e.g., a messaging application, voice and/or video chat application, or any other suitable interface for , wherein the instructions, when executed by the at least one processor, cause the computing device to associate the first user with the segment of the audio content further based on the authentication of the first user (Parag. [0052-0053], Parag. [0109], Parag. [0115], and Parag. [0138]; (The art teaches that FIG. 7 is a visual representation of an example output of diarization machine. In FIG. 6, a vertical axis is used to denote WHO (e.g., Bob) is speaking; the horizontal axis denotes WHEN (e.g., 30.01 s-34.87 s) that speaker is speaking; and the depth axis denotes from WHERE (e.g., 23°) that speaker is speaking. Diarization machine 132 uses this WHO/WHEN/WHERE information to label corresponding segments 604 of the audio signal(s) 606 under analysis with labels 608)).
It would be obvious to one of ordinary skill in the art at the time before the effective filling date of the claimed invention to modify Sinkov to incorporate the teaching of Diamant. This would be convenient to coordinate the conference, by providing a transcript of the conference to conference participants for subsequent review, tracking arrivals and departures of conference participants, providing cues to conference participants during the conference, and/or analyzing the information in order to summarize one or more aspects of the conference for subsequent review (Parag. [0024]).

Claim 

17. 	Sinkov discloses a system comprising: at least one processor; and a non-transitory computer readable storage medium comprising instructions that, when executed by the at least one processor (Parag. [0009-0010]), cause the system to:  
receive a first set of audio data from a first client device, the first set of audio data comprising audio content corresponding to speech from a plurality of participants of a meeting captured in a first plurality of volumes (Parag. [0009] and Parag. [0017-0021]; (The art teaches determining which of a plurality of specific personal audio input audio devices correspond to which specific meeting participants, measuring volume levels at each of the personal audio input devices in response to each of the meeting participants speaking, identifying that a first particular one of the participants is speaking based on relative volume levels at each of the personal audio input devices. Recording audio information from a meeting may also include simultaneously recording audio input (i.e., first set of audio data) at the first one of the personal audio input audio devices and on the first channel and audio input at the second one of the personal audio input audio devices and on the second channel in response to the first and ;  
receive a second set of audio data from a second client device, the second set of audio data comprising the audio content corresponding to the speech from the plurality of participants of the meeting captured in a second plurality of volumes (Parag. [0009] and Parag. [0017-0021]; (The art teaches determining which of a plurality of specific personal audio input audio devices correspond to which specific meeting participants, measuring volume levels at each of the personal audio input devices in response to each of the meeting participants speaking, identifying that a first particular one of the participants is speaking based on relative volume levels at each of the personal audio input devices. Recording audio information from a meeting may also include simultaneously recording audio input at the first one of the personal audio input audio devices and on the first channel and audio input (i.e., second set of audio data) at the second one of the personal audio input audio devices and on the second channel in response to the first and second meeting participants speaking at the same time. Recording audio information from a meeting may also include filtering the audio input at the first channel and the second channel to separate speech by the first participant from speech by the second participant. Filtering the audio input may be based on a distance related volume weakening coefficient, signal latency between the personal audio input devices, and/or ambient noise. In the event of double-talk when two or more speakers talk simultaneously for a period of time, the system may initially identify each speaker, and record double-talk on all principal smartphones owned by current speakers. After a double talk episode has ended, the system may attempt clearing each recorded fragment from double-talk by non-owners prior to placing it into the corresponding ;   
determine a segment of the audio content contributed by a first user of the first client device by analyzing the first set of audio data and the second set of audio data (Parag. [0017-0021]; (The art teaches recording audio information from a meeting may also include simultaneously recording audio input at the first one of the personal audio input audio devices and on the first channel and audio input at the second one of the personal audio input audio devices and on the second channel in response to the first and second meeting participants speaking at the same time. Recording audio information from a meeting may also include filtering the audio input at the first channel and the second channel to separate speech by the first participant from speech by the second participant. Filtering the audio input may be based on a distance related volume weakening coefficient, signal latency between the personal audio input devices, and/or ambient noise. In the event of double-talk when two or more speakers talk simultaneously for a period of time, the system may initially identify each speaker, and record double-talk on all principal smartphones owned by current speakers. Once the current speaker is identified, a particular smartphone of the speaker is marked by the system as a principal recording device and the system tracks a corresponding fragment of audio recording by that particular smartphone until a sufficiently long pause when the speaker either stopped talking to change the subject or for other reason or until the current speaker is replaced by another speaker. In either case, the fragment is picked by the system and added to the current channel of the speaker)) to: 
determine a primary speaking volume associated with the first client device by comparing the first plurality of volumes and the second plurality of volumes (Parag. [0009] and Parag. [0018-0021]; (The art teaches determining which of a plurality of specific personal audio input audio devices correspond to which specific meeting participants, measuring volume levels at each of the personal audio input devices in response to each of the meeting participants speaking, identifying that a first particular one of the participants is speaking based on relative volume levels at each of the personal audio input devices. Recording audio information from a meeting may also include simultaneously recording audio input at the first one of the personal audio input audio devices and on the first channel and audio input at the second one of the personal audio input audio devices and on the second channel in response to the first and second ;  
identify the segment of the audio content based on the primary speaking volume associated with the first client device; and associate the segment of the audio content with the first user of the first client device (Parag. [0017-0021]; (The art teaches that once the current speaker is identified, a particular smartphone of the speaker is marked by the system as a principal recording device and the system tracks a corresponding fragment of audio recording by that particular smartphone until a sufficiently long pause when the speaker either stopped talking to change the subject or for other reason or until the current speaker is replaced by another speaker. In either case, the fragment is picked by the system and added to the current channel of the speaker. Each channel of each speaker may therefore include subsequent fragments by a single speaker, uninterrupted by others and separated by pauses (such fragments may of course may be merged during post-processing of the meeting recording) or fragments separated in time by audio fragments from other speakers recorded in their channels. In addition to volume characteristics, latency of signal reception and explicit voice ID or audio profile of each speaker, generated by voice identification or voice recognition systems and stored in the system or on individual smartphones, may be used to further verify speaker identity and improve diarization. Note that other smartphones may remain in permanent recording modes at all times and therefore record the audio stream of each speaker, albeit with lower volume and clarity. However, .  
Sinkov doesn’t explicitly disclose Page 10 of 17 Responsive to Office Action mailed December 2, 2020 Page 10 of 17Responsive to Office Action mailed December 2, 2020analyze a transcript of the audio content comprising text representing the speech from the plurality of participants to identify a subset of text representing speech from the segment of the audio content; generate a digital meeting item based on the subset of text representing the speech from the segment of the audio content; and associate the digital meeting item with the first user based on associating the first user with the segment of the audio content.  
However, Diamant discloses:   
Page 10 of 17Responsive to Office Action mailed December 2, 2020analyze a transcript of the audio content comprising text representing the speech from the plurality of participants to identify a subset of text representing speech from the segment of the audio content (Parag. [0052-0053], Parag. [0109], Parag. [0115], and Parag. [0138]; (The art teaches that FIG. 7 is a visual representation of an example output of diarization machine. In FIG. 6, a vertical axis is used to denote WHO (e.g., Bob) is speaking; the horizontal axis denotes WHEN (e.g., 30.01 s-34.87 s) that speaker is speaking; and the depth axis denotes from WHERE (e.g., 23°) that speaker is speaking. Diarization machine 132 uses this WHO/WHEN/WHERE information to label corresponding segments 604 of the audio signal(s) 606 under analysis with labels 608. The segments 604 and/or corresponding labels may be output from the diarization machine 132 in any suitable format. The output effectively associates speech with a particular speaker during a conversation among N speakers, and allows the audio signal corresponding to each speech utterance (with WHO/WHEN/WHERE labeling/metadata) to be used for myriad downstream operations. One nonlimiting downstream operation is conversation transcription. FIG. 1B teaches a computerized conference assistant 106 may include a speech recognition machine 130. As shown in FIG. 8, the speech recognition machine 130 may be configured to translate an audio signal of recorded speech (e.g., signals 112, beamformed signal 150, signal 606, and/or segments 604) into text 800. The art also teaches that in some examples, transcribed speech and/or speaker identity information may be gathered by computerized intelligent assistant 1300 in real time, in order to build the transcript in real time, and/or in order to provide notifications to conference participants about the transcribed speech in real time. In some examples, computerized intelligent assistant 1300 may be configured, for a stream of speech audio captured by a microphone, to identify a current speaker 1300 may be able to provide notifications to the conference participants in real time (e.g., for display at companion devices) indicating that another conference participant is currently speaking and including transcribed speech of the other conference participant, even before the other conference participant has finished speaking. The art also teaches that the transcript may be analyzed using any suitable machine learning (ML) and/or artificial intelligence (AI) techniques, wherein such analysis may include, for raw audio observed during a conference, recognizing text corresponding to the raw audio, and recognizing one or more salient features of the text and/or raw audio. Non-limiting examples of salient features that may be recognized by ML and/or AI techniques include 1) an intent (e.g., an intended task of a conference participant), 2) a context (e.g., a task currently being performed by a conference participant), 3) a topic and/or 4) an action item or commitment (e.g., a task that a conference participant promises to perform)));  
generate a digital meeting item based on the subset of text representing the speech from the segment of the audio content (Parag. [0004], Parag. [0024], Parag. [0082], Parag. [0109], and Parag. [0138]; (The art teaches that the conference transcript can be used by participants for reviewing various multi-modal interactions and other events of interest that happened in the conference. The conference transcript can be analyzed to provide conference participants with feedback regarding their own participation in the conference, other participants, and team/organizational trends. The art also teaches that the transcript may be analyzed using any suitable machine learning (ML) and/or artificial intelligence (AI) techniques, wherein such analysis may include, for raw audio observed during a conference, recognizing text corresponding to the raw audio, and recognizing one or more salient features of the text and/or raw audio. Non-limiting examples of salient features that may be recognized by ML and/or AI techniques include 1) an intent (e.g., an intended task of a conference participant), 2) a context (e.g., a task currently being performed by a conference participant), 3) a topic and/or 4) an action item or commitment (e.g., a task that a conference participant promises to perform) (i.e., meeting item). Also, in an example, the art teaches that the machine learning classifier may be configured to receive any other suitable transcript data automatically recorded at 211, e.g., transcribed speech audio in the form of text. The transcription machine may be configured to analyze the ; and   
associate the digital meeting item with the first user based on associating the first user with the segment of the audio content (Parag. [0052-0053], Parag. [0109], Parag. [0115], and Parag. [0138]; (The art teaches that FIG. 7 is a visual representation of an example output of diarization machine. In FIG. 6, a vertical axis is used to denote WHO (e.g., Bob) is speaking; the horizontal axis denotes WHEN (e.g., 30.01 s-34.87 s) that speaker is speaking; and the depth axis denotes from WHERE (e.g., 23°) that speaker is speaking. Diarization machine 132 uses this WHO/WHEN/WHERE information to label corresponding segments 604 of the audio signal(s) 606 under analysis with labels 608. The segments 604 and/or corresponding labels may be output from the diarization machine 132 in any suitable format. The output effectively associates speech with a particular speaker during a conversation among N speakers, and allows the audio signal corresponding to each speech utterance (with WHO/WHEN/WHERE labeling/metadata) to be used for myriad downstream operations. One nonlimiting downstream operation is conversation transcription. The art also teaches that the transcript may be analyzed using any suitable machine learning (ML) and/or artificial intelligence (AI) techniques, wherein such analysis may include, for raw audio observed during a conference, recognizing text corresponding to the raw audio, and recognizing one or more salient features of the text and/or raw audio. Non-limiting examples of salient features that may be recognized by ML and/or AI techniques include 1) an intent (e.g., an intended task of a conference participant), 2) a context (e.g., a task currently being performed by a conference participant), 3) a topic and/or 4) an action item or commitment (e.g., a task that a conference participant promises to perform) (i.e., meeting item). Also, in an example, the art teaches that the machine learning classifier may be configured to receive any other suitable transcript data automatically recorded at 211, e.g., transcribed speech audio in the form of text. The transcription machine may be configured to analyze the transcript to detect words having a predefined sentiment (e.g., positive, negative, “happy”, or any other suitable sentiment), in order to present a sentiment analysis summary at a companion device of a conference participant (i.e., first user), indicating a frequency of utterance of words having the predefined sentiment)).  

 
Claim 18. 	Sinkov in view of Diamant discloses the system of claim 17, 
Sinkov doesn’t explicitly disclose the system further comprising instructions that, when executed by the at least one processor, cause the system to generate the transcript of the audio content based on at least one of the first set of audio data or the second set of audio data. 
However, Diamant discloses instructions that, when executed by the at least one processor, cause the system to generate the transcript of the audio content based on at least one of the first set of audio data or the second set of audio data (Parag. [0060]; (The art teaches that Labeled and/or partially labelled audio segments may be used to not only determine which of a plurality of N speakers is responsible for an utterance, but also translate the utterance into a textural representation for downstream operations, such as transcription)). Page 6 of 17  
It would be obvious to one of ordinary skill in the art at the time before the effective filling date of the claimed invention to modify Sinkov to incorporate the teaching of Diamant. This would be convenient to coordinate the conference, by providing a transcript of the conference to conference participants for subsequent review, tracking arrivals and departures of conference participants, providing cues to conference participants during the conference, and/or analyzing the information in order to summarize one or more aspects of the conference for subsequent review (Parag. [0024]). 
 
Claim 19. 	Sinkov in view of Diamant discloses the system of claim 18, 
Sinkov further discloses wherein the instructions, when executed by the at least one processor, causes the system to receive the first set of audio data from th(Parag. [0009] and . 

Claim 20. 	Sinkov in view of Diamant discloses the system of claim 17, 
Sinkov further discloses wherein comparing the first plurality of volumes and the second plurality of volumes to determine the primary speaking volume associated with the first client device comprises: determining a second primary speaking volume associated with the second client device, and determining the primary speaking volume associated with the first client device based on the second primary speaking volume associated with the second client device (Parag. [0009] and Parag. [0017-0021]; (The art teaches determining which of a plurality of specific personal audio input audio devices correspond to which specific meeting participants, measuring volume levels at each of the personal audio input devices in response to each of the meeting participants speaking, identifying that a first particular one of the participants is speaking based on relative volume levels at each of the personal audio input .





Conclusion
		The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. Yaghi et al. (US 2015/0003595) – Related art in the area of providing recording of telephone call data (Parag. [0036], distinguishing telephony calls comprising at least one of VoIP or plain old telephone system (POTS) calls from generic audio; processing audio comprising analyzing using a speech-to-text engine comprising at least one of: transcribing audio ; translating a language transcription; analyzing an audio transcript for keywords; enabling searches of audio content; or analyzing audio for possible filtering of at least one of unauthorized, or non-consensual recordings).
Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action. Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action.
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ABDELBASST TALIOUA whose telephone number is (571)272-4061.  The examiner can normally be reached on Monday-Thursday 7:30 am - 5:30 pm.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, William Trost can be reached on 571-272-7872.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.


/A.T./Examiner, Art Unit 2442

/WILLIAM G TROST IV/Supervisory Patent Examiner, Art Unit 2442