Notice of Pre-AIA  or AIA  Status

The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

Continued Examination Under 37 CFR 1.114

A request for continued examination under 37 CFR 1.114, including the fee set forth in 37 CFR 1.17(e), was filed in this application after final rejection.  Since this application is eligible for continued examination under 37 CFR 1.114, and the fee set forth in 37 CFR 1.17(e) has been timely paid, the finality of the previous Office action has been withdrawn pursuant to 37 CFR 1.114.  Applicant's submission filed on 6/28/2022 has been entered.
 
Claim Rejections - 35 USC § 103

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  

The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.


Claim(s) 1,2,4-6,9-17,19-21,24-29 are rejected under 35 U.S.C. 103 as being unpatentable over Tangeland (9633270) in view of Kikinis (20030083872).

As per claim 1, Tangeland (9633270) teaches a method for audio-visual multi-speaker speech separation, comprising:
receiving audio signals captured by at least one microphone (as microphone receiving signals – abstract);
 receiving video signals captured by at least one camera (as receiving/capturing video – abstract);
providing the audio signals and the video signals to a sync engine configured to:
	derive an audio vector from the audio signals and the video vector from the video signals (as, calculating audio direction and magnitude for the audio – fig. 6; and calculating image direction and angle – fig. 7)
	compute a correlation score by shifting either the audio vector or the video vector and compare the shifted vector against a remaining unshifted vector (as, performing a correlation function between the position of the input audio stream with the position of a capture face – col. 3 lines 46-51; examiner notes that the old and well known definition of correlation, is a shift-multiply-add function of data sets; and hence by definition, the correlation calculation takes a measure between one data set unshifted, and a second data set shifted); and
 and applying audio-visual separation on the received audio signals and the video signals to provide isolation of sounds from the at least one microphone and the at least one camera  (as detecting positions of active audio sources – col. 5 lines 1-5, and then isolating the signal to process, based on speaker position – col. 5 lines 15-27, col. 6 lines 9-25 – controller operates on the sensed audio, which is coming from the active speaker; and as, performing the processing based on if the talker position and the face position coincide – col. 6 line 66 –col. 7 line 5; where in the face position is based on an angle to the camera – col. 6 lines 54-63)),  based on the correlation score by generating an audio output comprising any of:
a time-shifted variant of the audio signal; a time shifted variant of the video signal (as the correlation score, by definition, measures a signal vs a time shifted signal).  

Tangeland teaches the facial image and audio synchronization as noted above; but does not explicitly teach audio signal time-shifted to synchronize with lip movement in the video signal; however, however, Kikinis (20030083872) teaches in a videoconferencing system (para 0030) tracking image accuracy (via, mouth shape – para 0040) and tracking motion points based on lip position (para 0043).  Therefore, it would have been obvious to one of ordinary skill in the art of videoconferencing to improve upon image attribute tracking of Tangeland (9633270) with lip position and image accuracy, as taught by Kikinis (20030083872) because it would advantageously provide with a way to execute image correction, which would give a more accurate result, not only with the image accuracy but also voice/speech recognition accuracy (Kikinis (20030083872) para 0012, para 0042). 

As per claim 2, the combination of Tangeland (9633270) in view of Kikinis (20030083872) teaches the method of claim 1, wherein the at least one microphone includes an array of microphones, and wherein the array of microphones is directed to a specific position in the space based on the angle positions (Tangeland (9633270), as microphone array – col. 2 lines 50-52; in a direction to measure the incoming angle – col. 6 lines 15-27).

As per claim 4, the combination of Tangeland (9633270) in view of Kikinis (20030083872) teaches the method of claim 1, further comprising:
determining, based on gesture recognition, the intention of a speaker to talk (Tangeland (9633270), as, detecting and storing previous facial images and facial positions – col. 11 lines 1-14; col. 9 line 61 – col. 10 line 5, shows closeup image/comparison shows the person is ready to talk, vs a farther/side view).

As per claim 5, the combination of Tangeland (9633270) in view of Kikinis (20030083872) teaches the method of claim 1, further comprising:
separating the audio signals into multiple distinct voice and noise channels by their contents (Tangeland (9633270), as separating signals based on active speaker voice vs background/no noise – col. 4 line 62 – col.5 line 4).

As per claim 6, the combination of Tangeland (9633270) in view of Kikinis (20030083872) teaches the method of claim 2, further comprising:
applying speech recognition on each separated audio channel (Tangeland (9633270), as analyzing speech characteristics to determine the speaker – col. 5 lines 42-60).

As per claim 9, the combination of Tangeland (9633270) in view of Kikinis (20030083872) teaches the method of claim 1, further comprising performing at least one of: applying echo cancellation on the received audio signals; or applying synchronization correction on the received audio signals (Tangeland (9633270), as tracking time delays to determine the accurate direction of the sound source --  col. 6 lines 28-46).

As per claim 10, the combination of Tangeland (9633270) in view of Kikinis (20030083872) teaches the method of claim 1, wherein applying audio-visual separation further comprises:
detecting faces appearing in the captured video signals; tracking each detected face; and
determining angle positions based on each tracked face, wherein the current speaker is determined by a tracked face (Tangeland (9633270), as determining angle position based on camera to face – col. 6 lines 50-62; and tracking clusters of speakers a closeup of the face – col. 10 lines 53-62).

As per claim 11, the combination of Tangeland (9633270) in view of Kikinis (20030083872) teaches the method of claim 10, wherein determining angle positions includes analysis of information on at least: image attributes captured by the at least one camera (Tangeland (9633270), image information – via, comparing image to a stored image – col. 10 lines 58-61), intrinsic parameters of the at least one camera (Tangeland (9633270), as closeup vs non-closeup of the camera -- col. 2 lines 25-32), and factors describing the angle and position of the sound source (Tangeland (9633270), and angle/position of the face – col. 6 lines 53-63).

As per claim 12, Tangeland (9633270) teaches the method of claim 11 (as mapped above), but does not explicitly teach wherein the image attributes further include: image coordinates, wherein the image coordinates define at least a position of a set of lips in the captured video, and undistorted image coordinates, wherein the undistorted image coordinates define a position of a set of lips in an undistorted image; however, Kikinis (20030083872) teaches in a videoconferencing system (para 0030) tracking image accuracy (via, mouth shape – para 0040) and tracking motion points based on lip position (para 0043).  Therefore, it would have been obvious to one of ordinary skill in the art of videoconferencing to improve upon image attribute tracking of Tangeland (9633270) with lip position and image accuracy, as taught by Kikinis (20030083872) because it would advantageously provide with a way to execute image correction, which would give a more accurate result, not only with the image accuracy but also voice/speech recognition accuracy (Kikinis (20030083872) para 0012, para 0042).

As per claim 13, the combination of Tangeland (9633270) in view of Kikinis (20030083872), as presented above in claim 12, teaches the method of claim 11 (see claim 11 above, the mapping of the Tangeland reference), further comprising:
converting the image coordinates into world coordinates; and determining a camera mouth angle position in a tracked face relative to the at least one camera based on the world coordinates (Kikinis (20030083872)as, generating position data points based on the angle position – para 0043, and the jawline – para 0044, with motion tracking – para 0014; reflecting back on the real-world coordinates of the distance between the face and camera – see Tangeland (9633270) figure 7, figure 8).

As per claim 14, the combination of Tangeland (9633270) in view of Kikinis (20030083872) as presented above in claim 12, teaches the method of claim 11 (see claim 11 above, the mapping of the Tangeland reference), further comprising:
determining a microphone camera angle position of the at least one camera relative to the at least one microphone (Tangeland (9633270) figures 6 and 7) and determining a mouth angle position in a tracked face (Kikinis (20030083872) – mouth position – para 0043, at different angles – para 0053) relative to the at least one microphone based on the camera mouth angle position and the microphone camera angle position (see also, Kikinis considering the camera/microphone position in figure 2, and using Kikinis’ mouth position, and that image, at a particular angle to the microphones as taught in Tangeland – col. 6, line 66 – col. 7 line 19).

Claim 15 is a non-transitory computer readable medium claim performing steps that are included in the method claims 1,2,4-6,9-11 above and as such, claim 15 is similar in scope and content to these common claim elements found in claims 1,2,4-6,9-11 above and therefore, claim 15 is rejected under similar rationale as presented against claims 1,2,4-6,9-11 above.  Furthermore, Tangeland (9633270) teaches computer readable storage medium (col. 4, lines 49-60)

Claims 16,17,19-21, 24-29 are system claims that perform the method steps of claims 1,2,4-6,9-11 above and as such, claims 16,17,19-21, 24-29 are similar in scope and content to method claims 1,2,4-6,9-11 and therefore, claims 16,17,19-21, 24-29 are rejected under similar rationale as presented against claims 1,2,4-6,9-11 above.  Furthermore, Tangeland (9633270) teaches processor, microphones, display (fig. 4, subblocks 444 and associated input/outputs – see MA for microphone array); and cameras – fig. 2, subblocks 112a,b.


Claims 3,18 are rejected under 35 U.S.C. 103 as being unpatentable over Tangeland (9633270) in view of Kikinis (20030083872) in further view of Feng (20180070053).

As per claim 3, the combination of Tangeland (9633270) in view of Kikinis (20030083872) teaches the method of claim 2, as mapped above, however, does not explicitly teach beamforming to control the microphone array; however, Feng (20180070053) teaches further comprising:
generating a beamformer control signal to control the aiming direction of the array of microphones (as, in a videoconferencing system tracking the motion of the user – abstract; and in doing so, using beamforming for the microphone array – para 0075).  Therefore, it would have been obvious to one of ordinary skill in the art of audio source tracking to modify the microphone array in the combination of Tangeland (9633270) in view of Kikinis (20030083872) with beamforming weighting, as taught by Feng (20180070053) because it would improve the accuracy of measuring the voice signal based on the physical position of the user, even when the active talker’s head is facing away ( Feng (20180070053), para 0075).  The combination of Tangeland (9633270) in view of Kikinis (20030083872) in view of Feng (20180070053) teaches tracking by based on a facial recognition (Tangeland, col. 11 lines 8-13).

Claim 18 is a system claims that perform the method steps of claim 3 above and as such, claim 18 is similar in scope and content to method claim 3 and therefore, claim 18 is rejected under similar rationale as presented against claim 3 above.  Furthermore, Tangeland (9633270) teaches processor, microphones, display (fig. 4, subblocks 444 and associated input/outputs – see MA for microphone array); and cameras – fig. 2, subblocks 112a,b.

Claims 7,8,21,22 are rejected under 35 U.S.C. 103 as being unpatentable over the combination of Tangeland (9633270) in view of Kikinis (20030083872) in view of Kitada (20180365657) .

As per claim 7, the combination of Tangeland (9633270) in view of Kikinis (20030083872) teaches the method of claim 6 above (see mapping above),however, does not explicitly teach: “further comprising:
applying natural language processing on recognized speech to extract user intention”, however, Kitada (20180365657) teaches receiving/operating on meeting audio/video data/information (abstract, para 0049) using speech recognition/language understanding  -- para 0074, to derive content/intention – see para 0075, determining from the phrase ‘past meeting’, recognizing the spoken words, and analyzing metadata to determining the actual date/time of the last meeting (para 0075).  Therefore, it would have been obvious to one of ordinary skill in the art of conferencing systems to add to the speech characteristic processing of Tangeland (9633270) with speech recognition and context determination, as taught by Kitada (20180365657) because it would advantageously automate meeting details and save time by using speech recognition for the automation (Kitada (20180365657), para 0013)    

As per claim 8, the combination of Tangeland (9633270) in view of Kitada (20180365657) teaches the method of claim 7, further comprising:
personalizing voice commands based on the separated voice channels (Kitada (20180365657) , para 0093, wherein the speech is a voice command).

Claims 21,22 are system claims that perform the method steps of claims 7,8 above and as such, claims 21,22 are similar in scope and content to method claims 7,8 and therefore, claims 21,22 are rejected under similar rationale as presented against claims 7,8 above.  Furthermore, Tangeland (9633270) teaches processor, microphones, display (fig. 4, subblocks 444 and associated input/outputs – see MA for microphone array); and cameras – fig. 2, subblocks 112a,b.
Response to Arguments

Applicant's arguments filed 06/28/2022 have been fully considered but are not persuasive.  On pp 8-9 of the response, examiner notes that, especially toward the recitations to KSR, the motivation to combine the references have come from the references themselves, in the rejections above.  On page 10 of the response, applicants re-iterate portions of the prior art rejection and summarizes those recitations; as to the argument “it is not clear weather Kikinis…lip movement in the video signal”, examiner notes that in para 0043 of Kikinis, during speech (audio), the motion deltas are recorded and tracked with the lip movement of the user.  Examiner further notes, that, the time synchronization of the audio signal to the video signal, is taught by the Tangeland reference; examiner points to the rejection above, pointing to a correlation calculation of the Tangeland reference and that by mathematical definition, a correlation function time-shifts the two sets of data, multiplies, and the peak number shows a “match”/correlation between the two data sets.  In summary, the Tangeland reference is used to teach the correlation function (ie, time shifting) between the audio and video signals; and the Kikinis reference is introduced to further teach lip/mouth synchronization with speech/audio signals; and that it is the combination of the Tangeland reference in view of the Kikinis reference that meets the current claim scope.  In response to applicant's arguments against the references individually, one cannot show nonobviousness by attacking references individually where the rejections are based on combinations of references.  See In re Keller, 642 F.2d 413, 208 USPQ 871 (CCPA 1981); In re Merck & Co., 800 F.2d 1091, 231 USPQ 375 (Fed. Cir. 1986).  As to the arguments on pp 10-11 of the response, examiner argues that the beamforming technique in Feng emphasizes/weights the microphones (and hence, ‘steering’) in the direction of the active talker (para 0075).  As to applicants arguments on the last two-thirds of pp 11 of the response, examiner argues that the recited para 0075 of Kitada, show the context analysis of “Where did we leave off at the last meeting”, wherein clearly the users intent is to find out the current discussion – as noted in para 0075; and hence, showing the ability, to discern intent.  Lastly, applicants arguments provide a generic statement of ‘it is not clear’ without providing a compare/contrast/differential between the claim scope and the recited sections of the prior art, other than a general allegation – see  37 CFR 1.111(b).
   
Conclusion

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure.  
As to applicants claims/spec toward video/audio conference with natural language processing/speech recognition/command recognition:
Stork et al (5586215) figure 8.
	Huang (20180293988) teaches in a videoconference setting (para 0143), context and command/control speech recognition and understanding (para 0106)
	Ng et al (20170195128) teaches a meeting environment (abstract) wherein voice commands are recognized and utilized (para 0021, 0026)
As to applicants claims/spec toward video/audio conference with lip/mouth features and angles wrt the video source:
	Mizumoto (20160064000) teaches videoconferencing (para 0006) using lip detecting/positioning – para 0059, positions of the mouth (para 0056) to perform sound location of the imaged user (para 0204).

Any inquiry concerning this communication or earlier communications from the examiner should be directed to Michael Opsasnick, telephone number (571)272-7623, who is available Monday-Friday, 9am-5pm. 
If attempts to reach the examiner by telephone are unsuccessful, the examiner's supervisor, Mr. Richemond Dorvil, can be reached at (571)272-7602.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free).

/Michael N Opsasnick/Primary Examiner, Art Unit 2658                                                                                                                                                                                                        07/12/2022