Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

IDS filed on 10/23/2020, 09/03/2021, and 06/07/2022 have been received and entered. Claims 1-20 have been cancelled. Claims 21-40 have been added, currently claims 21-40 remained pending.
Please refer to the action below.

Examiner Notes
The claims in this application are given their broadest reasonable interpretation using the plain meaning of the claim language in light of the specification as it would be understood by one of ordinary skill in the art. However, the claimed subject matter, not the specification, is the measure of the invention. 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

Claims 21-40 rejected under 35 U.S.C. 103 as being unpatentable and obvious over Wexler et al. (US 2022/0021985, A1), in view of Diamant et al (US 2019/0341050, A1).

     Regarding claim 1, Wexler teaches a method, by a user device (Wexler teaches a device of para. 0227-0228, 0243 for recognizing and identifying a user based on acquired facial information, speech data and relation information and a sound source localization information) comprising: receiving an input from a user (receiving in para. 0227-0228 and 0243 voice data from a user and nearby users, streamed images of said users, and relation information illustrated in para. 0237 and 0375 of the user to possible other users); 
obtaining at least one of audio information, video information, and relation information of the user (audio information, video information of para. 0227-0228,  and relation information of the user of para. 0227 and 0237); 
identifying the user based on the audio information or the video information of the user and a set of facial recognition result and speech recognition result that is correlated with the user (user identifying of para. 0227-0228, 0237, and 0299 based on acquired audio information and/or video information of the user and a set of generated facial recognition result and speech recognition result of 0227 and 0299, said recognition result or output (emphasis added) are obviously indicative of said claimed embeddings that is correlated with the user);
 the set of facial recognition result and speech recognition result being generated using a facial embedding model, a speech embedding model, and a sound source localization model (para. 0230 further implies using sound 2020/2421 localize a source  in para. 0230 and 0265 by an obvious sound source localization method, generated set of facial recognition result and speech recognition result being generated in a case by the embedding models of para. 0519 or the implied facial model and speech model of para. 0375 comprising understoodly said facial embedding model, a speech embedding model, and a sound source localization model); and 
performing an action based on the input and at least one of identifying the user and the relation information of the user (performing in para. 0227-0228, and 0237 at least one of identifying the user and the relation information of the user and further in at least para. 0245-0246, performing a selective conditioning of identified users voices). 
       Wexler teaches the claimed limitations as illustrated above except for specifically citing facial embeddings and speech embeddings that is correlated with the user, said set of facial embeddings and speech embeddings being generated using said facial embedding model, said speech embedding model, and said sound source localization model.
     Diamant teaches in at least Figs. 1-6 used and/or generated beamforming model 122 comprising a sound source localization model, a facial embedding model 0124/0126 further shown in para. 0041, and 0043 and a speech model 128 analogous to a face model configured to output vector embeddings representations, the system of Diamant uses a set of said models to generate said embeddings for recognizing said user and/or apply said embeddings to said models to generate said recognizing of said user and the relation information of the user. It would have been obvious to one of ordinary skill in the art at the time the invention was made to combine the teachings of Wexler in view of Diamant to include wherein said facial embeddings and speech embeddings that is correlated with the user, said set of facial embeddings and speech embeddings being generated using said facial embedding model, said speech embedding model, and said sound source localization model, as discussed above as Wexler in view of Diamant are in the same field of endeavor of using machine models for processing received user facial images, and audio inputs and generating using said models facial and/or audio out representations or embeddings for recognizing and identifying the above users, Diamant further complements the embeddings models of Wexler by further using a speech model generating vector representations embeddings, a facial model, and a source localization model for using at least a set of the speech model, facial model, and the source localization model for generating further said embeddings for further recognizing users and their relations to other users from acquired audio information, video information, and relation information and source localization data to perform a predetermined action according to known means, to yield further predictable results since known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of ordinary skill in the art as said combination is thus the adaptation of an old idea or invention using newer technology that is either commonly available and understood in the art thereby a variation on already known art (See MPEP 2143, KSR Exemplary Rationale F).

    Regarding claim 22 (according to claim 21), Wexler further teaches wherein further comprising: generating, using the facial embedding model, the facial embeddings (para. 0232 further teaches detected or generated videos/images features or representation indicative of facial embeddings which understoodly maybe generated by at least facial model 2040 of further para. 0225-0228, said representations or embeddings as further illustrated in at least para. 0232 and 0225-00228 maybe used by said understood facial embedding model 2040 for generating said facial embeddings).

    Regarding claim 23 (according to claim 21), Wexler further teaches wherein further comprising:3PRELIMINARY AMENDMENTAttorney Docket No.: A255698 Appln. No.: 17/079,111 generating, using the speech embedding model, the speech embeddings (para. 0519 further teaches embeddings models for processing received which obviously maybe used for generating said speech embeddings).

    Regarding claim 24 (according to claim 21), Wexler further teaches wherein further comprising: identifying a label associated with the user (para. 0493 further teaches identifying a label associated with the user); and correlating the label with the user, based on identifying the label (para. 0493).  

    Regarding claim 25 (according to claim 24), Wexler is silent regrading wherein further comprising: determining a confidence score associated with the label.      
    Diamant further teaches the labels of “face 1” of para. 0031 further including in at least para. 0032 determined confidence values associated obviously with said label. It would have been obvious to one of ordinary skill in the art at the time the invention was made to combine the teachings of Wexler in view of Diamant to include wherein said determining confidence score associated with the label, as discussed above as Wexler in view of Diamant are in the same field of endeavor of using machine models for processing received user facial images, and audio inputs and generating using said models facial and/or audio output representations or embeddings for recognizing and identifying the above users, Diamant further complements the embeddings models of Wexler by further using speech model generating vector representations embeddings, a facial model, and a source localization model corresponding to associated labels and calculated confidence score to further more accurately identify said users for further recognizing users and their relations to other users from acquired audio information, video information, and relation information and source localization data to perform a predetermined action according to known means, to yield further predictable results since known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of ordinary skill in the art as said combination is thus the adaptation of an old idea or invention using newer technology that is either commonly available and understood in the art thereby a variation on already known art (See MPEP 2143, KSR Exemplary Rationale F).

    Regarding claim 26 (according to claim 21), Wexler further teaches wherein further comprising: generating, using the facial embedding model, a facial embedding of the user, based on the video information of the user (generated facial features of further para. 0225-0228 further comprising understoodly facial embeddings which may obviously be used by the model 2040 for generating said facial embedding of the user, based on the video information of the user); and identifying the user, based on comparing the facial embedding of the user and the set of facial embeddings that is correlated with the user (the system of further para. 0225-0228 may understoodly identify said used based on at least identified said faces features indicative of a facial embedding of the user and set of facial output identifications indicative of set of facial embeddings which further understoodly maybe correlated to further identify said user).

    Regarding claim 27 (according to claim 21), Wexler further teaches wherein further comprising: generating, using the speech embedding model, a speech embedding of the user, based on the audio information of the user (using embeddings models of further para. 0519 for processing at least audio or speech further obviously may  generate said speech embedding of the user, based on the audio information of the user); comparing the speech embedding of the user and the set of speech embeddings that is correlated with the user (the system of further para. 0519 may likewise be obviously adapted for comparing said speech embedding of the user and previous set of speech embeddings that is correlated with the user);
 and4PRELIMINARY AMENDMENTAttorney Docket No.: A255698 Appln. No.: 17/079,111 identifying the user, based on comparing the speech embedding of the user and the set of speech embeddings that is correlated with the user (para. 0519).  

    Regarding claim 28, Wexler teaches a user device (Wexler teaches a device of para. 0227-0228, 0243 for recognizing and identifying a user based on acquired facial information, speech data and relation information and a sound source localization information) comprising: a memory (para. 0016) configured to store instructions; and a processor (para. 0016) configured to execute the instructions to: 
receive an input from a user (receiving in para. 0227-0228 and 0243 voice data from a user and nearby users, streamed images of said users, and relation information illustrated in para. 0237 and 0375 of the user to possible other users); 
obtain at least one of audio information, video information, and relation information of the user (audio information, video information of para. 0227-0228,  and relation information of the user of para. 0227 and 0237); 
identify the user based on the audio information or the video information of the user and a set of facial recognition result and speech recognition result that is correlated with the user (user identifying of para. 0227-0228, 0237, and 0299 based on acquired audio information and/or video information of the user and a set of generated facial recognition result and speech recognition result of 0227 and 0299, said recognition result or output (emphasis added) are obviously indicative of said claimed embeddings that is correlated with the user);
 the set of facial recognition result and speech recognition result being generated using a facial embedding model, a speech embedding model, and a sound source localization model (para. 0230 further implies using sound 2020/2421 localize a source  in para. 0230 and 0265 by an obvious sound source localization method, generated set of facial recognition result and speech recognition result being generated in a case by the embedding models of para. 0519 or the implied facial model and speech model of para. 0375 comprising understoodly said facial embedding model, a speech embedding model, and a sound source localization model); and 
perform an action based on the input and at least one of identifying the user and the relation information of the user (performing in para. 0227-0228, and 0237 at least one of identifying the user and the relation information of the user and further in at least para. 0245-0246, performing a selective conditioning of identified users voices). 
     Wexler teaches the claimed limitations as illustrated above except for specifically citing facial embeddings and speech embeddings that is correlated with the user, said set of facial embeddings and speech embeddings being generated using said facial embedding model, said speech embedding model, and said sound source localization model.
       Diamant teaches in at least Figs. 1-6 used and/or generated beamforming model 122 comprising a sound source localization model, a facial embedding model 0124/0126 further shown in para. 0041, and 0043 and a speech model 128 analogous to a face model configured to output vector embeddings representations, the system of Diamant uses a set of said models to generate said embeddings for recognizing said user and/or apply said embeddings to said models to generate said recognizing of said user and the relation information of the user. It would have been obvious to one of ordinary skill in the art at the time the invention was made to combine the teachings of Wexler in view of Diamant to include wherein said facial embeddings and speech embeddings that is correlated with the user, said set of facial embeddings and speech embeddings being generated using said facial embedding model, said speech embedding model, and said sound source localization model, as discussed above as Wexler in view of Diamant are in the same field of endeavor of using machine models for processing received user facial images, and audio inputs and generating using said models facial and/or audio out representations or embeddings for recognizing and identifying the above users, Diamant further complements the embeddings models of Wexler by further using a speech model generating vector representations embeddings, a facial model, and a source localization model for using at least a set of the speech model, facial model, and the source localization model for generating further said embeddings for further recognizing users and their relations to other users from acquired audio information, video information, and relation information and source localization data to perform a predetermined action according to known means, to yield further predictable results since known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of ordinary skill in the art as said combination is thus the adaptation of an old idea or invention using newer technology that is either commonly available and understood in the art thereby a variation on already known art (See MPEP 2143, KSR Exemplary Rationale F).

    Regarding claim 29 (according to claim 28), Wexler further teaches wherein the processor is further configured to: generate, using the facial embedding model, the facial embeddings (para. 0232 further teaches detected or generated videos/images features or representation indicative of facial embeddings which understoodly maybe generated by at least facial model 2040 of further para. 0225-0228, said representations or embeddings as further illustrated in at least para. 0232 and 0225-00228 maybe used by said understood facial embedding model 2040 for generating said facial embeddings).
 
    Regarding claim 30 (according to claim 29), Wexler further teaches wherein the processor is further configured to:5PRELIMINARY AMENDMENTAttorney Docket No.: A255698Appln. No.: 17/079,111 generate, using the speech embedding model, the speech embeddings (para. 0519 further teaches embeddings models for processing received which obviously maybe used for generating said speech embeddings).

    Regarding claim 31 (according to claim 28), Wexler further teaches wherein the processor is further configured to: identify a label associated with the user (para. 0493 further teaches identifying a label associated with the user); and correlating the label with the user, based on identifying the label (para. 0493).  

    Regarding claim 32 (according to claim 31), Wexler is silent regrading wherein the processor is further configured to: determine a confidence score associated with the label.  
        Diamant further teaches the labels of “face 1” of para. 0031 further including in at least para. 0032 determined confidence values associated obviously with said label. It would have been obvious to one of ordinary skill in the art at the time the invention was made to combine the teachings of Wexler in view of Diamant to include wherein said determining confidence score associated with the label, as discussed above as Wexler in view of Diamant are in the same field of endeavor of using machine models for processing received user facial images, and audio inputs and generating using said models facial and/or audio output representations or embeddings for recognizing and identifying the above users, Diamant further complements the embeddings models of Wexler by further using speech model generating vector representations embeddings, a facial model, and a source localization model corresponding to associated labels and calculated confidence score to further more accurately identify said users for further recognizing users and their relations to other users from acquired audio information, video information, and relation information and source localization data to perform a predetermined action according to known means, to yield further predictable results since known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of ordinary skill in the art as said combination is thus the adaptation of an old idea or invention using newer technology that is either commonly available and understood in the art thereby a variation on already known art (See MPEP 2143, KSR Exemplary Rationale F).

    Regarding claim 33 (according to claim 28), Wexler further teaches wherein the processor is further configured to: generate, using the facial embedding model, a facial embedding of the user, based on the video information of the user (generated facial features of further para. 0225-0228 further comprising understoodly facial embeddings which may obviously be used by the model 2040 for generating said facial embedding of the user, based on the video information of the user); and identify the user, based on comparing the facial embedding of the user and the set of facial embeddings that is correlated with the user (the system of further para. 0225-0228 may understoodly identify said used based on at least identified said faces features indicative of a facial embedding of the user and set of facial output identifications indicative of set of facial embeddings which further understoodly maybe correlated to further identify said user).

    Regarding claim 34 (according to claim 27), Wexler further teaches wherein the processor is further configured to:6PRELIMINARY AMENDMENTAttorney Docket No.: A255698Appln. No.: 17/079,111 generate, using the speech embedding model, a speech embedding of the user, based on the audio information of the user (using embeddings models of further para. 0519 for processing at least audio or speech further obviously may  generate said speech embedding of the user, based on the audio information of the user); compare the speech embedding of the user and the set of speech embeddings that is correlated with the user (the system of further para. 0519 may likewise be obviously adapted for comparing said speech embedding of the user and previous set of speech embeddings that is correlated with the user); and4PRELIMINARY AMENDMENTAttorney Docket No.: A255698Appln. No.: 17/079,111 identify the user, based on comparing the speech embedding of the user and the set of speech embeddings that is correlated with the user (para. 0519).  

    Regarding claim 35, Wexler teaches in at least para. 0016 a non-transitory computer-readable medium storing instructions that, when executed, cause at least one processor of a user device (Wexler teaches a device of para. 0227-0228, 0243 for recognizing and identifying a user based on acquired facial information, speech data and relation information and a sound source localization information)  to: user (receiving in para. 0227-0228 and 0243 voice data from a user and nearby users, streamed images of said users, and relation information illustrated in para. 0237 and 0375 of the user to possible other users); 
obtain at least one of audio information, video information, and relation information of the user (audio information, video information of para. 0227-0228,  and relation information of the user of para. 0227 and 0237); 
identify the user based on the audio information or the video information of the user and a set of facial recognition result and speech recognition result that is correlated with the user (user identifying of para. 0227-0228, 0237, and 0299 based on acquired audio information and/or video information of the user and a set of generated facial recognition result and speech recognition result of 0227 and 0299, said recognition result or output (emphasis added) are obviously indicative of said claimed embeddings that is correlated with the user);
 the set of facial recognition result and speech recognition result being generated using a facial embedding model, a speech embedding model, and a sound source localization model (para. 0230 further implies using sound 2020/2421 localize a source  in para. 0230 and 0265 by an obvious sound source localization method, generated set of facial recognition result and speech recognition result being generated in a case by the embedding models of para. 0519 or the implied facial model and speech model of para. 0375 comprising understoodly said facial embedding model, a speech embedding model, and a sound source localization model); and 
perform an action based on the input and at least one of identifying the user and the relation information of the user (performing in para. 0227-0228, and 0237 at least one of identifying the user and the relation information of the user and further in at least para. 0245-0246, performing a selective conditioning of identified users voices). 
     Wexler teaches the claimed limitations as illustrated above except for specifically citing facial embeddings and speech embeddings that is correlated with the user, said set of facial embeddings and speech embeddings being generated using said facial embedding model, said speech embedding model, and said sound source localization model.
      Diamant teaches in at least Figs. 1-6 used and/or generated beamforming model 122 comprising a sound source localization model, a facial embedding model 0124/0126 further shown in para. 0041, and 0043 and a speech model 128 analogous to a face model configured to output vector embeddings representations, the system of Diamant uses a set of said models to generate said embeddings for recognizing said user and/or apply said embeddings to said models to generate said recognizing of said user and the relation information of the user. It would have been obvious to one of ordinary skill in the art at the time the invention was made to combine the teachings of Wexler in view of Diamant to include wherein said facial embeddings and speech embeddings that is correlated with the user, said set of facial embeddings and speech embeddings being generated using said facial embedding model, said speech embedding model, and said sound source localization model, as discussed above as Wexler in view of Diamant are in the same field of endeavor of using machine models for processing received user facial images, and audio inputs and generating using said models facial and/or audio out representations or embeddings for recognizing and identifying the above users, Diamant further complements the embeddings models of Wexler by further using a speech model generating vector representations embeddings, a facial model, and a source localization model for using at least a set of the speech model, facial model, and the source localization model for generating further said embeddings for further recognizing users and their relations to other users from acquired audio information, video information, and relation information and source localization data to perform a predetermined action according to known means, to yield further predictable results since known work in one field of endeavor may prompt variations of it for use in either the same field or a different one based on design incentives or other market forces if the variations are predictable to one of ordinary skill in the art as said combination is thus the adaptation of an old idea or invention using newer technology that is either commonly available and understood in the art thereby a variation on already known art (See MPEP 2143, KSR Exemplary Rationale F).

    Regarding claim 36 (according to claim 35), Wexler further teaches wherein the instructions further cause the at least one processor to: generate, using the facial embedding model, the facial embeddings (para. 0232 further teaches detected or generated videos/images features or representation indicative of facial embeddings which understoodly maybe generated by at least facial model 2040 of further para. 0225-0228, said representations or embeddings as further illustrated in at least para. 0232 and 0225-00228 maybe used by said understood facial embedding model 2040 for generating said facial embeddings).

    Regarding claim 37 (according to claim 29), Wexler further teaches wherein the instructions further cause the at least one processor to: generate, using the speech embedding model, the speech embeddings (para. 0519 further teaches embeddings models for processing received which obviously maybe used for generating said speech embeddings).

    Regarding claim 38 (according to claim 35), Wexler further teaches wherein the instructions further cause the at least one processor to: identify a label associated with the user (para. 0493 further teaches identifying a label associated with the user); and correlate the label with the user, based on identifying the label (para. 0493).  

    Regarding claim 39 (according to claim 35), Wexler further teaches wherein
the instructions further cause the at least one processor to: generate, using the facial embedding model, a facial embedding of the user, based on the video information of the user (generated facial features of further para. 0225-0228 further comprising understoodly facial embeddings which may obviously be used by the model 2040 for generating said facial embedding of the user, based on the video information of the user); and identify the user, based on comparing the facial embedding of the user and the set of facial embeddings that is correlated with the user (the system of further para. 0225-0228 may understoodly identify said used based on at least identified said faces features indicative of a facial embedding of the user and set of facial output identifications indicative of set of facial embeddings which further understoodly maybe correlated to further identify said user).

    
   Regarding claim 40 (according to claim 35), Wexler further teaches wherein
the instructions further cause the at least one processor to: generate, using the speech embedding model, a speech embedding of the user, based on the audio information of the user (using embeddings models of further para. 0519 for processing at least audio or speech further obviously may  generate said speech embedding of the user, based on the audio information of the user); compare the speech embedding of the user and the set of speech embeddings that is correlated with the user (the system of further para. 0519 may likewise be obviously adapted for comparing said speech embedding of the user and previous set of speech embeddings that is correlated with the user); and4PRELIMINARY AMENDMENTAttorney Docket No.: A255698Appln. No.: 17/079,111 identify the user, based on comparing the speech embedding of the user and the set of speech embeddings that is correlated with the user (para. 0519).  

Conclusion
     Any inquiry concerning this communication or earlier communications from the examiner should be directed to MARCELLUS AUGUSTIN whose telephone number is (571)270-3384. The examiner can normally be reached 9 AM- 5 PM.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, BENNY TIEU can be reached on 571-272-7490. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/MARCELLUS J AUGUSTIN/Primary Examiner, Art Unit 2674                                                                                                                                                                                                        07/27/2022