Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114
A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 7/29/2021 has been entered.

Response to Arguments
Applicant's arguments filed 7/29/2021 have been fully considered but they are not persuasive because of these reasons:
On last para of page 6 under The Rejection of Claims Under 102, the applicant has amended claim1 with the addition “wherein at least one of the received audio signals received from the set of multiple distributed devices is from a mobile device of a first user”. Rainisto cites  the distributed devices can be mobile “[0026] The term ‘computer’, ‘computing device’, ‘apparatus’ or ‘mobile apparatus’ is used herein to refer to any apparatus … incorporated into many different devices and therefore the terms ‘computer’ and ‘computing device’ each include PCs, servers, mobile telephones (including smart phones), tablet computers, Voice over IP phones, set-top boxes, media players…”, and additionally it makes association of such devices with the user “[0039] In an embodiment, identifying information may also be read from devices carried by the participants”
simultaneously, but typically one person, the primary speaker, has the floor. The primary speaker, in case of more than one speaker, may be identified by determining head and eye movements and/or orientation of the participants. In case a single primary speaker is indeterminate, multiple speakers may be simultaneously identified”, and in the following paragraph Rainisto mention the association of speech with speakers: “[0041] In step 405, the digital text may be associated with the identity of its speaker. Also a point of time of the speech and respective digital text may be detected and obtained” and Shen [0078] states “To compose the final audio/video stream, audio/video fusion module is configured to combine audio and video data using an algorithm based on the diarization result of audio stream while taking into account the intermediate video-based active speaker detection results of the input video streams. The algorithm is based on the co-occurrence of the moments of speaking/non-speaking transitions for the same speaker and change of speakers among audio streams and video streams. An active speaker can be identified using cross-correlation between moments-vectors, assuming in meetings, most of the time only one person speaks. In case that a speaker is never captured by any camera, his/her video may be absent. His/her video may be replaced with other video from other sources, either randomly or following certain rules”
In paragraph 2 on page 7 The applicant states “there is no teaching in Rainisto that beamforming can be performed using multiple distributed devices.” There is no requirement in the claims for beamforming to be distributed.  But such support is implicit in Rainisto. [0016] recites “n an embodiment, a microphone 203 may be an array microphone capable of beamforming. Beamforming may be used to capture audio from a single speaker from a plurality of speakers”, 
The objections in paragraph 3 have already been explained in relation to the explanations in the first paragraph of page 3 in this Office Action. 
In paragraph 4 the applicant states “Shen does not use fusion to identify speakers as claimed”. This has been clearly addressed in the Advisory Action and repeated here.  Shen “[0078] An active speaker can be identified using cross-correlation between moments-vectors, assuming in meetings, most of the time only one person speaks. In case that a speaker is never captured by any camera, his/her video may be absent. His/her video may be replaced with other video from other sources, either randomly or following certain rules.” In reference to statement in the paragraph regarding transcript generation, this too was stated in the Advisory and repeated here. Shen [0085] recites “In addition, a video-based processing module (e.g., 414, 444) may be a composite module, in which multiple sub-modules can be optionally executed (e.g., FIG. 5). The more such sub-modules are applied, the more meta information (such as users' IDs, users' face expressions, active speaker, etc.) can be obtained. These meta information may be used to annotate the final composed audio/video stream as well as the transcripts, and can be leveraged to better organize the meeting log and provide tags for more efficient meeting log review.”  The statement “… Shen only identifies an active speaker and does not use the fusion model to associate respective speech from overlapped speech with specific users as now claimed” is refuted by Shen “[0020] Some embodiments may utilize an audio/video fusion algorithm to combine matching audio and video signals of the same speaker to compose an output audio/video stream. Some embodiments may link such information as meeting attendees' manual notes and automatically detected meta data to the recorded audio/video content and/or a transcription of the recording generated by automatic speech recognition. In addition, some embodiments may provide a reverse editing feature to support editing of audio/video meeting records.”

In light of the above citations, the examiner does not find the applicant’s arguments as persuasive. The rejection of claims 1-20 stands.

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 1-20 are rejected under 35 U.S.C. 103 as being unpatentable over Rainisto (US 20170060828 A1) in view of Shen (US 20190341068 A1).

A machine-readable storage device having instructions for execution by a processor ([0027] “Computer executable instructions may be provided using any computer-readable media that are accessible by the device 200 “) of a machine to cause the processor to perform operations to perform a method (inherent for device 200);
a processor ([0026] “The term ‘computer’, ‘computing device’, ‘apparatus’ or ‘mobile apparatus’ is used herein to refer to any apparatus with processing capability such that it can execute instructions.”); 
and a memory device coupled to the processor and having a program stored thereon for execution by the processor ([0046] “The methods and functionalities described herein may be performed by software in machine readable form on a tangible storage medium e.g. in the form of a computer program comprising computer program code means adapted to perform all the functions and the steps of any of the methods described herein when the program is run on a computer and where the computer program may be embodied on a computer readable medium”)
to perform operations comprising:
 receiving information streams on a meeting server ([0031] “The multiple cameras 201 and multiple microphones 203 may be coupled to at least one processor 202 and/or at least one storage 204”) from a set of multiple distributed devices included in a meeting ([0031] “… multiple cameras 201 and multiple microphones 203 may be configured throughout a meeting space”);
 receiving audio signals representative of overlapped ([0040] “In a meeting multiple persons may be speaking simultaneously, but typically one person, the primary speaker, has the floor”) speech by at least two users in at least two of the information streams ([0015] “…a wherein at least one of the received audio signals received from the set of multiple distributed devices is from a first mobile device of a first user ([0026] “The term…‘computer’ and ‘computing device’ each include PCs, servers, mobile telephones (including smart phones), tablet computers, Voice over IP phones, set-top boxes, media players…”, and  “[0039] In an embodiment, identifying information may also be read from devices carried by the participants); 
receiving at least one video signal of at least one user in the information streams ([0016]”… at least one camera 201 may be configured to capture video of the meeting”); 
[[associating specific users with their respective speech in the received audio signals as a function of the received audio and video signals by providing a fusion of the audio signals and video signal to a model to provide audio streams for each user with a user ID]] and 
[[generating a transcript of the meeting with an indication of the specific users associated with the overlapped speech]] 
Rainisto does not teach associating specific users with their respective speech in the received audio signals as a function of the received audio and video signals by providing a fusion of the audio signals and video signal to a model to provide audio streams for each user with a user ID. 
generating a transcript of the meeting with an indication of the specific users associated with the overlapped speech
Shen teaches associating specific users with their respective speech in the received audio signals ([0020] “Some embodiments may utilize an audio/video fusion algorithm to combine matching audio and video signals of the same speaker to compose an output audio/video stream. Some embodiments may link such information as meeting attendees' manual notes and automatically by providing a fusion of the audio signals and           video signal to a model to provide audio streams for each user with a user ID ([0087] The timestamped information (e.g., tags, transcriptions, etc.) may be associated to the final fused audio/video stream and the transcript, through the timestamps”, and   [0085]” In addition, a video-based processing module (e.g., 414, 444) may be a composite module, in which multiple sub-modules can be optionally executed (e.g., FIG. 5). The more such sub-modules are applied, the more meta information (such as users' IDs, users' face expressions, active speaker, etc.) can be obtained. These meta information may be used to annotate the final composed audio/video stream as well as the transcripts, and can be leveraged to better organize the meeting log and provide tags for more efficient meeting log review.” )
generating a transcript of the meeting with an indication of the specific users associated with the overlapped speech ([0085] “In addition, a video-based processing module (e.g., 414, 444) may be a composite module, in which multiple sub-modules can be optionally executed (e.g., FIG. 5). The more such sub-modules are applied, the more meta information (such as users' IDs, users' face expressions, active speaker, etc.) can be obtained. These meta information may be used to annotate the final composed audio/video stream as well as the transcripts, and can be leveraged to better organize the meeting log and provide tags for more efficient meeting log review”, and “[0056] For example, audio processing module 412 may include Na processing channels to process Na audio streams, either in parallel or in series”
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify the teachings of Rainisto to include the teachings of Shen, 

With reference to claim 2 Rainisto does not teach wherein the multiple distributed devices comprise wireless devices associated with users in the meeting and wherein the model comprises a fusion model.
Shen teaches wherein the multiple distributed devices comprise wireless devices ([0029] “I/O devices 230 may include devices that facilitate the capturing, sending, receiving and consuming of meeting information. I/O devices 230 may include, for example, a camera 232, a microphone 234, a display 238, a keyboard, buttons, switches, a touchscreen panel, and/or a speaker (only camera 232, microphone 234, and display 238 are shown in FIG. 2 for conciseness) … In some embodiments, network interface 236 may include an integrated services digital network (ISDN) card, a cable modem, a satellite modem, or another type of modem used to provide a data communication connection. As another example, network interface 236 may include a local area network (LAN) card to provide a data communication connection to a compatible LAN. Wireless links can also be implemented by client device 200 via I/O devices 230”) associated with users in the meeting and wherein the model comprises a fusion model ([0020] “Some embodiments may utilize an audio/video fusion algorithm to combine matching audio and video signals of the same speaker to compose an output audio/video stream. Some embodiments may link such information as meeting attendees' manual notes and automatically detected meta data to the recorded audio/video content and/or a transcription of the recording generated by automatic speech recognition. In addition, ”) 
It would have been obvious to one of ordinary skill in the art prior to the effective filing date of the invention to modify the teachings of Rainisto to include the teachings of Shen, motivation being that synchronizing the video of a speaker with audio would help a watcher of 
the logs to grasp the context of the meeting (Shen [0005]).
With respect to claim 3, Rainisto teaches wherein the first mobile device includes a camera and provides the at least one video signal ([0016] “…at least one camera 201 may be configured to capture video of the meeting.”)
With respect to claim 4, Rainisto teaches wherein the first mobile device processes the at least one video signal provided to identify that a user associated with the first mobile device is speaking ([0019] “The processor 202 may analyze video from the camera 201 and/or audio from the microphone 203 to determine a speaker…”).
With respect to claim 5, Rainisto teaches wherein one of the at least one audio signal that is received from the first mobile device includes a tag identifying the user ([0041] “…a speech recognition profile based upon a speaker's identity may be used for speech to digital 

With respect to claim 6, Rainisto teaches wherein the multiple distributed devices include an ambient device having multiple microphones supported in a fixed configuration, each microphone providing one of the received audio signals ([0016] "... a microphone 203 may be an array microphone capable of beamforming. Beamforming may be used to capture audio from a single speaker or from a plurality of speakers")
With respect to claim 7, Rainisto teaches wherein one of the at least one video signal that is received having a field of view configured to include multiple users in the meeting and provide the at least one video signal ([0016] “…a camera 201 may be a 360° view camera”).
With respect to claim 8, Rainisto teaches wherein the multiple distributed devices include a fixed camera ([0031] “…at least one camera 201 and at least one microphone 203 may be configured so as to be wholly contained within a device 200”) having a view of one or more users ([0017]” A camera 201 may capture a video of participants…) in the meeting.
With respect to claim 10, Rainisto teaches wherein a fusion model is used on the received audio and video signals to associate the specific user with the speech ([0019] “The processor 202 may analyze video from the camera 201 and/or audio from the microphone 203 to determine a speaker and the speaker's location with respect to other participants.”)  
With respect to claim 11, Rainisto teaches wherein the multiple distributed devices comprise wireless devices ([0026] “… ‘computing device’ each include PCs, servers, mobile telephones…”, Fig. 5, [0034]: “…At least one of the cameras 2011…microphones 2031…may be 
With respect to claim 12, Rainisto teaches wherein a first mobile device includes a camera and provides the at least one video signal ([0016] “…at least one camera 201 may be configured to capture video of the meeting.”)  
With respect to claim 13, Rainisto teaches wherein the first mobile device processes the at least one video signal provided to identify that a user associated with the first mobile device is speaking ([0019] “The processor 202 may analyze video from the camera 201 and/or audio from the microphone 203 to determine a speaker…”)
With respect to claim 14, Rainisto teaches wherein one of the at least one audio signal that is received from the first mobile device includes a tag identifying the user associated with the first mobile device as speaking ([0041] “…a speech recognition profile based upon a speaker's identity may be used for speech to digital text conversion. In step 405, the digital text may be associated with the identity of its speaker.")  
With respect to claim 15, Rainisto teaches wherein the multiple distributed devices include an ambient device having multiple microphones supported in a fixed configuration, each microphone providing one of the received audio signals ([0016] "... a microphone 203 may be an array microphone capable of beamforming. Beamforming may be used to capture audio from a single speaker or from a plurality of speakers.")  
With respect to claim 16, Rainisto teaches wherein one of the at least one video signal that is received having a field of view configured to include multiple users in the meeting and provide the at least one video signal ([0016] “…a camera 201 may be a 360° view camera”.

With respect to claim 18, Rainisto teaches  wherein a fusion model is used on the received audio and video signals to associate the specific user with the speech ([0019] “The processor 202 may analyze video from the camera 201 and/or audio from the microphone 203 to determine a speaker and the speaker's location with respect to other participants.”)  and wherein the multiple distributed devices comprise wireless devices ([0026] “ ‘computing device’ each include PCs, servers, mobile telephones…”, Fig. 5, [0034]: “…At least one of the cameras 2011…microphones 2031…may be disposed at various locations…“, and [0035]: “…This includes…implementations over a wired or wireless network…”) associated with users in the meeting, wherein a first mobile device includes a camera and provides the at least one video signal ([0016] “…at least one camera 201 may be configured to capture video of the meeting.”), and wherein the first mobile device 46406420-US-NP processes the at least one video signal provided to identify that a user associated with the first mobile device is speaking ([0019] “The processor 202 may analyze video from the camera 201 and/or audio from the microphone 203 to determine a speaker and the speaker's location with respect to other participants.”)  
With respect to claim 19, Rainisto teaches wherein one of the at least one audio signal that is received from the first mobile device includes a tag identifying the user associated with the first mobile device as speaking ([0041] “…a speech recognition profile based upon a speaker's identity may be used for speech to digital text conversion. In step 405, the digital text may be associated with the identity of its speaker.")
With respect to claim 20  wherein the multiple distributed devices include an ambient device having multiple microphones supported in a fixed configuration, each microphone 


Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to ATHAR N PASHA whose telephone number is (408)918-7675.  The examiner can normally be reached on Monday-Thursday Alternate Fridays, 7:30-4:30 PT.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Daniel Washburn can be reached on (571)272-5551.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.   Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.


/Paras D Shah/Primary Examiner, Art Unit 2659                                                                                                                                                                                                        
08/13/2021