Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION

Claim Rejections - 35 USC § 112
The following is a quotation of 35 U.S.C. 112(b):
(b)  CONCLUSION.—The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the inventor or a joint inventor regards as the invention.


The following is a quotation of 35 U.S.C. 112 (pre-AIA ), second paragraph:
The specification shall conclude with one or more claims particularly pointing out and distinctly claiming the subject matter which the applicant regards as his invention.


Claims 1-10, 20 rejected under 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, as being indefinite for failing to particularly point out and distinctly claim the subject matter which the inventor or a joint inventor (or for applications subject to pre-AIA  35 U.S.C. 112, the applicant), regards as the invention. Claims 1, 20 recites the limitation "the data structure" in last and second to last clauses respectively.  There is insufficient antecedent basis for this limitation in the claim. Claims 2-10 rejected at least for dependence on indefinite claim 1.

Claim Rejections - 35 USC § 103

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory 
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 1-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kalinli-Akbacak (US 9,031,293 B2 hereinafter Kal, and further in view of Wang: 20180314689.

In re Claim 1, Kal discloses a method, data structure, etc. to store speaker-resolved language data (Kal: see FIGS. 1A and 1D; col. 2: ll. 9-39; and cols. 2-3: ll. 40-9) in a data structure in a computer system (Kal: see FIGS. 7-8 and cols. 10-13: ll. 49-57), the method comprising: 
receiving audio data recording speech from one or more speakers (Kal: see FIG. 1A: via sensors 102 and cols. 3-4: ll. 10-56); 
converting the audio data into a linguistic representation comprising words or phonemes corresponding to the recorded speech (Kal: see FIG. 1A: linguistic features 111 and cols. 3-4: ll. 10-56); 
receiving non-linguistic data (Kal: see FIG. 1A: context features 105, acoustic features 107, visual features 109, physical features 113) associated with at least one of the one or more speakers (see cols. 3-4: ll. 10-56); 
via a sensor-fusion machine-learning model 
(Kal: see FIG. 1D and col. 7: ll. 34-40) 
trained previously for speaker emotion identification (Kal: see cols. 9-10: ll. 17-48) 
concurrently processing the words or phonemes of the linguistic representation and the non-linguistic data (Id., and see FIG. 1D: via 112’) to distinguish one of the one or more speaker emotions (id.); and 
committing to the data structure language data associated with the identified speaker emotion and based on the linguistic representation (Kal: see FIGS. 1A and 1D: via determined emotional state 115 and/or change state 110).

Thus Kal teaches a data structure comprising the necessary structure and parameters to perform the claimed subject matter but does not explicitly teach the data structure operative in a system comprising speaker recognition as claimed.

In a related field of endeavor Wang teaches a system and method comprising: receiving audio data recording speech from one or more speakers (Wang: ¶ 61-65, 85, 86; Fig 1: a virtual personal assistant operable to determine user intent, user emotional state and voice characteristics based on analysis of input user speech and other user parameters); 
converting the audio data into a linguistic representation comprising words or phonemes corresponding to the recorded speech (Wang: ¶ 61-65, 85, 86, 213-216; Fig 1: system accepts audio input and determines words therein, vocal characteristics thereof); 
receiving non-linguistic data associated with at least one of the one or more speakers (id.  e.g. the determination of voice biometrics, characteristics etc. as well as facial expressions etc.); 
via a sensor-fusion machine learning model trained previously for speaker identification, concurrently processing the words or phonemes of the linguistic representation and the non-linguistic data to distinguish the at least one of the one or more speakers from among the one or more speakers (Wang: ¶ 111-113, 213-216, 222, 223, 239-242, 247: multi-modal interpretation component identifies a speaker and determines intent thereof, the speaker identification utilizing linguistic components of the speech such as a passphrase as well as non-linguistic components of the speech such as voice biometrics or other voice characteristics based on pre trained voice recognition models); and 
committing to a data structure language data associated with the identified speaker and based on the linguistic representation (id. system accesses previously trained recognition models based on an identified, determined, etc. speaker).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to augment the Kal taught data structure to accommodate the various modalities (facial, voiceprint, body language, ontology, etc. recognition) of the Wang speaker recognition method. The average skilled practitioner would have been motivated to do so for at least the purpose of: optimizing particular signal processing with respect to an identified speaker, iteratively performing speaker diarization, etc. and would have expected only predictable results therefrom.

In re Claim 2, Kal in view of Wang teaches or suggests wherein the speaker is identified via a sensor-fusion machine-learning system previously trained to process the linguistic representation and another form of input concurrently (Kal: FIG. 1D); (Wang: ¶ 111-113, 213-223, 239-247).

In re Claim 3, Kal in view of Wang teaches or suggests wherein the speaker is identified based on directional microphony (Kal: col. 3: ll. 34-45 and col. 4: ll. 49-56); (Wang: ¶ 111-113, 213-223, 239-247).

In re Claim 4, Kal in view of Wang teaches or suggests wherein the speaker is identified based on a voiceprint (Kal: col. 3: ll. 34-45 and cols. 9-10: ll. 62-48); (Wang: ¶ 111-113, 213-223, 239-247).

In re Claim 5, Kal in view of Wang teaches or suggests wherein identifying the speaker includes storing the voiceprint of the speaker during a calibration phase and matching the stored voiceprint to a post-calibration voiceprint acquired from the audio data (Kal: cols. 9-10: ll. 17-48); (Wang: ¶ 111-113, 202,  213-223, 239-247).

In re Claim 6, Kal in view of Wang teaches or suggests wherein the speaker is identified based on face recognition (Kal: col. 3: ll. 46-51); (Wang: ¶ 111-113, 202,  213-223, 239-247: a speaker recognition engine includes facial recognition engine 1322).

In re Claim 7, Kal in view of Wang teaches or suggests wherein the speaker is identified based on posture analysis (Kal: col. 3: ll. 46-51: recognition of body position and motion); (Wang: ¶ 308: system includes a pose recognizer) 

In re Claim 8, Kal in view of Wang teaches or suggests wherein the speaker is identified based on semantic analysis of the linguistic representation of the recorded speech (see col. 4: ll. 15-32: linguistic and semantic context analyzed); (Wang: ¶ 111-113, 202,  213-223, 239-247, 302: system identifies a user based on voice biometrics, etc. the biometrics operable in concert with an understanding and interpretation components and operable to determine a particular user as well as the contains on said user in operation of the system). It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to include aspects of the Kal and Wang taught semantic analysis within the speaker recognition method. The average skilled practitioner would have been motivated to do so for at least the purpose of user verification, identification, preference and constraint processing, low compute determination of intent, etc. and would have expected only predictable results therefrom.

In re Claim 9, Kal in view of Wang teaches or suggests wherein converting the audio data includes filtering candidate linguistic representations of the recorded speech based on a corpus associated with the identified speaker (Kal: cols. 3-4: ll. 10-56 and cols. 9-10: ll. 62-48; see also col. 13: ll. 53-57: system maintains a dictionary of recognized utterances); (Wang: ¶ 111-113, 137-142, 202,  213-223, 239-247, 302, 348: speech filtered in concert with a corpus of commands operable in concert with user preferences to allowable only particular users to issue commands within the corpus, that is a list of commands is persisted with respect to a user to whom those commands are relevant; the system also persists a user dialog history, domain ontology and corpus by which the system curates the user intent upon the ontology using the corpus). 

In re Claim 10, Kal in view of Wang teaches or suggests wherein the audio data is converted to a natural language linguistic representation via a previously-trained natural language machine (see cols. 3-4: ll. 10-56 and cols. 9-10: ll. 62-48; see also col. 13: ll. 53-57); (Wang: ¶ 73-79, 91, 110-122, etc.: system operates and teaches the utility as well known, trained natural language models for the processing of an input user speech).  

In re Claim 11, Kal discloses a method, data structure, etc.  to store semantically resolved language data (Kal: see FIGS. 1A and 1D; col. 2: ll. 9-39; and cols. 2-3: ll. 40-9) in a data structure in a computer system (Kal: see FIGS. 7-8 and cols. 10-13: ll. 49-57), the method comprising: 
receiving audio data recording speech from one or more speakers (Kal: see FIG. 1A: via sensors 102 and cols. 3-4: ll. 10-56); 
converting the audio data into a linguistic representation comprising words or phonemes corresponding to the recorded speech (Kal: see FIG. 1A: linguistic features 111 and cols. 3-4: ll. 10-56);  
receiving non-linguistic data (Kal: see FIG. 1A: context features 105, acoustic features 107, visual features 109, physical features 113) associated with at least one of the one or more speakers (see cols. 3-4: ll. 10-56); 
via a sensor-fusion machine-learning model trained previously for topic detection (Kal: cols. 2: ll. 32-47; 7: ll. 34-61; 9-10: ll. 62-48; 13: ll. 53-57: multimodal analysis system uses linguistic cues such as the meaning of words and other contextual topics such as game state, play, etc. to determine, classify, etc. an  underlying user emotional state and the relation of the emotional state to the meaning of the utterance), 
concurrently processing the words or phonemes of the linguistic representation and the non-linguistic data (Kal: id. and see also col. 7: ll. 34-40; FIG. 1A, 1D: via 112’) to distinguish one of the one or more speaker emotions based at least in part on linguistic and semantic analysis, representation, etc. (id.); and 
committing to a data structure language data associated with the detected topic and based on the linguistic representation (Kal: id.: via determined emotional state 115 and/or change state 110 and in concert with linguistic and semantic data).

Thus Kal teaches a data structure comprising the necessary structure and parameters to perform the claimed subject matter but does not explicitly teach the data structure operative in a system trained to detect topics and assign corresponding linguistic representations to detected topics as claimed.

In a related field of endeavor Wang teaches a system and method comprising: receiving audio data recording speech from one or more speakers (Wang: ¶ 61-65, 85, 86; Fig 1: a virtual personal assistant operable to determine user intent, user emotional state and voice characteristics based on analysis of input user speech and other user parameters); 
converting the audio data into a linguistic representation comprising words or phonemes corresponding to the recorded speech (Wang: ¶ 61-65, 85, 86, 213-216; Fig 1: system accepts audio input and determines words therein, vocal characteristics thereof); 
receiving non-linguistic data associated with at least one of the one or more speakers (id.  e.g. the determination of voice biometrics, characteristics etc. as well as facial expressions etc.); 
for topic identification, detection, etc., concurrently processing the words or phonemes of the linguistic representation and the non-linguistic data to detect a topic corresponding to the linguistic representation (Wang: ¶ 111-115, 157, 213-216, 222, 223, 239-242, 247, 300, 325: multi-modal interpretation component identifies a speaker and determines intent thereof, the determination of intent utilizing linguistic components of the speech such as a passphrase as well as non-linguistic components of the speech such as voice biometrics or other voice characteristics based on pre trained voice recognition models, the determination of intent encompassing particular topics, learned rules based thereon and user relations thereto); and 
committing to a data structure language data associated with the detected topic and based on the linguistic representation (id. system accesses previously trained recognition models with respect to an identified, determined, intent, topics, etc.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to augment the Kal taught data structure to accommodate the various modalities (facial, voiceprint, body language, ontology, etc. recognition) of the Wang system and method including the determination of intent, topic, etc. The average skilled practitioner would have been motivated to do so for at least the purpose of: optimizing particular signal processing with respect to an identified speaker, iteratively performing speaker diarization, etc. and would have expected only predictable results therefrom.

In re Claim 12, Kal in view of Wang teaches or suggests wherein converting the audio data includes filtering based on semantic comparison of the linguistic representation against the detected topic (Kal: cols. 3-4: ll. 10-56 and cols. 9-10: ll. 62-48; see also col. 13: ll. 53-57); (Wang: ¶ 111-115, 157, 213-216, 222, 223, 239-242, 247, 300, 325).  
In re Claim 13, Kal in view of Wang teaches or suggests wherein the topic is detected in a trained machine-learning module by semantic analysis of the linguistic representation  (see col. 4: ll. 15-32: linguistic and semantic context analyzed); (Wang: ¶ 111-113, 202,  213-223, 239-247, 302: system identifies a user based on voice biometrics, etc. the biometrics operable in concert with an understanding and interpretation components and operable to determine a particular user as well as the contains on said user in operation of the system). 

In re Claim 14, Kal in view of Wang teaches or suggests further comprising identifying a speech target corresponding to the linguistic representation (Kal: cols. 3-4: ll. 10-56 and cols. 9-10: ll. 62-48); (Wang: ¶ 63, 199, 201, etc.). It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to augment the Kal in view of Wang system, method, etc. to include the Wang taught method to resolve a particular target such as that determined using the linguistic, image, etc. analysis of Wang. The average skilled practitioner would have been motivated to do so for the purpose of combining known elements to achieve known results and would have expected only predictable results therefrom.

In re Claim 15, Kal in view of Wang teaches or suggests wherein the speech target is identified based on posture analysis (Kal: col. 3: ll. 46-51: recognition of body position and motion); (Wang: ¶ 63, 199, 201, etc.: the system operates to perform pose analysis and image analysis to determine a potential target). It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to utilize the Wang taught posture and pose analyses to resolve a particular target such as that determined using the image analysis of Wang. The average skilled practitioner would have been motivated to do so for the purpose of combining known elements to achieve known results and would have expected only predictable results therefrom.

In re Claim 16, Kal in view of Wang teaches or suggests wherein the speech target is identified based on facial recognition (see col. 3: ll. 46-51); (Wang: ¶ 111-113, 202,  213-223, 239-247: a speaker recognition engine includes facial recognition engine 1322). Please see claim 15 supra, the claim is rejected based on a similar rationale to that expressed with respect to claim 15.

In re Claim 17, Kal in view of Wang teaches or suggests wherein the speech target includes the computer system (Kal: FIGS. 7-8 and cols. 10-13: ll. 49-57); (Wang: ¶ 111-115, 157, 213-216, 222, 223, 239-242, 247, 300, 325: the speech target determination includes the computer system as well as indicating the computer system of a target, such as the phone thereof). It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to utilize the Wang taught computer system on a plurality of computer systems of a plurality of users of the Kal in view of Wang system and method. The average skilled practitioner would have been motivated to do so for the purpose of  distributing tools whereby users might communicate over distances, and/or across geographical, language, contextual, etc. diversities and would have expected only predictable results therefrom.

In re Claim 18, Kal in view of Wang teaches or suggests further comprising backfilling previously unresolved linguistic elements of the data structure based on the identified speech target (Kal  col. 3: ll. 10-33, col. 4: ll. 15-32, cols. 9-10: ll. 17-48, col. 13: ll. 53-57: the system maintains a dictionary based on recognized and/or disambiguated speech); (Wang: ¶ 403, 416, 426, etc.: system resolve ambiguities, requests disambiguation, etc.). It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to utilize the Kal or Wang taught methods of disambiguation to maintain the Kal in view of Wang dictionary, data structure etc. The average skilled practitioner would have been motivated to do so for the purpose of  backfilling a dictionary, data structure, etc. based on learned words, contexts, etc. and would have expected predictable results therefrom.

In re Claim 19, Kal in view of Wang teaches or suggests further comprising backfilling previously unresolved linguistic elements of the data structure based on the detected topic (see col. 3: ll. 10-33, col. 4: ll. 15-32, cols. 9-10: ll. 17-48, col. 13: ll. 53-57); (Wang: ¶ 111-115, 157, 213-216, 222, 223, 239-242, 247, 300, 325, 403, 416, 426). Please see claim 18 supra, the claim is rejected based on a similar rationale to that expressed with respect to claim 18.

In re Claim 20, Kal discloses a method, data structure, etc.  to store semantically resolved language data (Kal: see FIGS. 1A and 1D; col. 2: ll. 9-39; and cols. 2-3: ll. 40-9) in a data structure in a computer system (Kal: see FIGS. 7-8 and cols. 10-13: ll. 49-57: one or more context features 105, acoustic features 107, visual features 109, linguistic features 111, and physical features 113 of user may be derived from signals obtained by one or more sensors 102 t determine emotional state of a user(s) during issue of instructions to a computer and augment the response of the computer thereby), and thereby to execute computer-actionable directives in concert with information conveyed in human speech (Kal: FIGS. 1A, 1D; cols. 3-4: ll. 10-56; cols. 9-10: ll. 17-48: , the method comprising: 
receiving audio data recording speech from one or more speakers (Kal: see FIG. 1A: via sensors 102 and cols. 3-4: ll. 10-56); 
converting the audio data into a linguistic representation comprising words or phonemes corresponding to the recorded speech (Kal: see FIG. 1A: linguistic features 111 and cols. 3-4: ll. 10-56);  
(Kal: see FIG. 1A: context features 105, acoustic features 107, visual features 109, physical features 113) associated with at least one of the one or more speakers (see cols. 3-4: ll. 10-56); 
detecting a targeted emotion class corresponding to the linguistic representation (Kal: cols. 3-4: ll. 10-56 and cols. 9-10: ll. 62-48); via a sensor-fusion machine-learning model trained previously for target emotion identification (Kal: cols. 2: ll. 32-47; 7: ll. 34-61; 9-10: ll. 62-48; 13: ll. 53-57: multimodal analysis system uses linguistic cues such as the meaning of words and other contextual topics such as game state, play, etc. to determine, classify, etc. an  underlying user emotional state and the relation of the emotional state to the meaning of the utterance), 
to identify the target state by concurrently processing the words or phonemes of the linguistic representation and the non-linguistic data (Kal: id. and see also col. 7: ll. 34-40; FIG. 1A, 1D: via 112’) to identify a target emotion corresponding to the linguistic representation (Kal: cols. 2: ll. 32-47; 7: ll. 34-61; 9-10: ll. 62-48; 13: ll. 53-57); 
committing to the data structure language data associated with the identified target emotion and based on the linguistic representation (Kal: id.: via determined emotional state 115 and/or change state 110 and in concert with linguistic and semantic data).

Kal does not explicitly teach detecting a target corresponding to the linguistic representation via a sensor-fusion machine-learning model trained previously for target identification; to thereby identify a target corresponding to the linguistic representation; committing to the data structure language data associated with the identified target and based on the linguistic representation; parsing the data structure to identify in the language data one or more of the computer actionable directives actionable by a computer identified as the target; and submitting the one or more directives to the computer for processing.

A method to execute computer-actionable directives conveyed in human speech (Wang: ¶ 61-65, 85, 86; Fig 1: a virtual personal assistant operable to determine user intent, user emotional state and voice characteristics based on analysis of input user speech and other user parameters and to execute commands therein based thereon);
the method comprising: receiving audio data recording speech from one or more speakers (Wang: id); 
converting the audio data into a linguistic representation comprising words or phonemes corresponding to the recorded speech (Wang: ¶ 61-65, 85, 86, 213-216; Fig 1: system accepts audio input and determines words therein, vocal characteristics thereof);
receiving non-linguistic data associated with at least one of the one or more speakers; via a sensor-fusion machine-learning model trained previously for target identification (id.  e.g. the determination of voice biometrics, characteristics etc. as well as facial expressions etc.),  concurrently processing the words or phonemes of the linguistic representation and the non-linguistic data (Wang: ¶ 111-115, 157, 213-216, 222, 223, 239-242, 247, 300, 325: multi-modal interpretation component identifies a speaker and determines intent thereof, the determination of intent utilizing linguistic components of the speech such as a passphrase as well as non-linguistic components of the speech such as voice biometrics or other voice characteristics based on pre trained voice recognition models, the determination of intent encompassing particular topics, learned rules based thereon and user relations thereto and operative to instruct he computer to perform particular commands)
to identify a target corresponding to the linguistic representation  (Wang: ¶ 63, 113, 199, 201, etc.: system operates to determine an object, person, or other target of a command intention).; 
committing to a data structure language data associated with the identified target and based on the linguistic representation (Wang: ¶ 111-115, 157, 199, 213-216, 222, 223, 239-242, 247, 300, 325: system accesses previously trained recognition models with respect to an identified, determined, intent, topics, etc. processes user input language based thereon and updates a structure in memory therewith); parsing the data structure to identify in the language data one or more directives actionable by a computer identified as the target (Wang: ¶ 199, 424, 473: system performs an action with respect to a recognized target and computer associated therewith); and submitting the one or more directives to the computer for processing (id.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to augment the Kal taught data structure to accommodate the various modalities (facial, voiceprint, body language, ontology, target, etc. recognition, resolution, etc.) of the Wang system and method including the determination of intent, topic, etc., the determination of a command object or target and the execution of an instruction based thereon. The average skilled practitioner would have been motivated to do so for at least the purpose of: optimizing particular signal processing with respect to executing an intent of an identified speaker, with respect to an object, target, etc. thereof and would have expected only predictable results therefrom.

Conclusion

Any inquiry concerning this communication or earlier communications from the examiner should be directed to PAUL C MCCORD whose telephone number is (571)270-3701. The examiner can normally be reached 730-630 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.

Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/PAUL C MCCORD/               Primary Examiner, Art Unit 2654