Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION

Claim Rejections - 35 USC § 112
Applicant’s amendments to claims 1, 20 filed 6/15/22 suffice to obviate the 35 U.S.C. 112(b) or 35 U.S.C. 112 (pre-AIA ), second paragraph, rejection of claims 1-10, 20.

Claim Rejections - 35 USC § 103

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 1-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kalinli-Akbacak (US 9031293 hereinafter Kal, and further in view of Wang: 20180314689 and further in view of Wheeler: 10565989 hereinafter Whee.

In re Claim 1, Kal discloses a method, data structure, etc. to store speaker-resolved language data (Kal: see FIGS. 1A-1D; col. 2: ll. 9-39; and cols. 2-3: ll. 40-9) in a data structure in a computer system (Kal: see FIGS. 7-8 and cols. 10-13: ll. 49-57), the method comprising: 
receiving audio data recording speech from one or more speakers (Kal: see FIG. 1A-1D: via sensors 102 and cols. 3-4: ll. 10-56); 
converting the audio data into a linguistic representation comprising words or phonemes corresponding to the recorded speech (Kal: see FIG. 1A: linguistic features 111 and cols. 3-4: ll. 10-56); 
receiving non-linguistic data (Kal: see FIG. 1A: context features 105, acoustic features 107, visual features 109, physical features 113) associated with at least one of the one or more speakers (see cols. 3-4: ll. 10-56); 
via a sensor-fusion (as per ¶ 154, etc. of the instant specification sensor fusion is considered a cooperative application-of plural sensory or contextual inputs as such any two inputs of the inputs disclosed by Kal may comprise the recited “sensor-fusion”) machine-learning model (Kal: see Abstract; FIG. 1A-1D and col 1, ll. 13-34 col.3. ll. 10 – col 4. ll. 48, 7: ll. 34-40; Table I, II: sensor data such as acquired visual, acoustic, etc. data such as from cameras, microphones, etc. said acquired data operative in context with additional data and generative of attendant features used to identify one or more emotional categories of a particular user state) trained previously for speaker emotion identification (Kal: see cols. 9-10: ll. 17-48) 
concurrently processing the words or phonemes of the linguistic representation and the non-linguistic data (Id., and see FIG. 1D: via 112’) to distinguish one of the one or more speaker emotions (id.); and 
committing language data based on the linguistic representation to a data structure, (Kal: see Col 7. ll. 42-64, 9-10: ll. 17-48, col. 13, ll. 15-58: FIGS. 1A-1D; Table I, II: determined emotional state parameters, etc. added to a machine learning model for classification thereof subsequent dimensional reduction and determination, modification, etc. of a user emotional state; additionally, determined state and other input data used to update a data structure).

Thus Kal teaches a data structure comprising the necessary structure and parameters to perform the claimed subject matter but does not explicitly committing language data to a data structure, the language data identifying the at least one of the one or more speakers.
 
In a related field of endeavor Wang teaches a system and method comprising: receiving audio data recording speech from one or more speakers (Wang: ¶ 61-65, 85, 86; Fig 1: a virtual personal assistant operable to determine user intent, user emotional state and voice characteristics based on analysis of input user speech and other user parameters); 
converting the audio data into a linguistic representation comprising words or phonemes corresponding to the recorded speech (Wang: ¶ 61-65, 85, 86, 213-216; Fig 1: system accepts audio input and determines words therein, vocal characteristics thereof); 
receiving non-linguistic data associated with at least one of the one or more speakers (id.  e.g. the determination of voice biometrics, characteristics etc. as well as facial expressions etc.); 
via a sensor-fusion machine learning model trained previously for speaker identification, concurrently processing the words or phonemes of the linguistic representation and the non-linguistic data to distinguish the at least one of the one or more speakers from among the one or more speakers (Wang: ¶ 85-88, 111-115, 213-216, 222, 223, 239-242, 245-256; Fig 18: multi-modal interpretation component identifies a speaker and determines intent thereof, the speaker identification utilizing linguistic components of the speech such as a passphrase as well as non-linguistic components of the speech such as voice biometrics, image input, and/or other characteristics based on pre trained voice recognition models, the data acquired from a plurality of input devices such as a microphone, camera, or other inputs and operative to ascertain, verify, etc. a user identity by at least the user of stored joint speaker models; the determination of intent utilizing linguistic components of the speech such as speech audio, passphrase, etc. as well as non-linguistic components of the speech such as voice biometrics, statistics and/or other voice characteristics, components, parameters, etc. based on pre trained voice recognition models, etc. the determination of intent encompassing particular topics, learned rules based thereon and user relations thereto and operative to instruct the computer to perform particular commands); and 
committing language data to a data structure, the language data identifying the at least one of the one or more speakers (id. And ¶ 3, 4, 19, Analyzer 1800 provides, outputs, etc. speaker identification and input speech content to other devices and/or systems by output of a command and speaker determination at 1838, said command and speaker data stored or persisted in the computer system at least in to form of output to at least a buffer, memory, etc.).

It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to augment, update, etc. the Kal taught data structure to additionally accommodate the various modalities and determinations (facial, voiceprint, body language, ontology, etc. recognition) of the Wang speaker recognition method. The average skilled practitioner would have been motivated to do so for at least the purpose of: optimizing particular signal processing with respect to an identified speaker, iteratively performing speaker diarization, etc. and would have expected only predictable results therefrom.

Kal in view of Wang can be considered to strongly suggest the recited subject matter in as much as the speaker dependent and speaker adapted models of Wang necessarily bear identifying data of the particular speaker upon which the model depends (see at least Wang ¶ 220-223, etc.) and by which a particular model may be invoked based on input data including audio input analysis, video input analysis and/or tactile analysis of a user input said input data operative to perform command and speaker determinations with regard to a specific identified user (Wang: ¶ 191-198, 237, 252-256, etc.; fig 11, 18, etc.). As such Kal in view of Wang implicitly associates a speaker dependent and/or speaker adaptive model with a particular identified speaker but does not explicitly teach the committing, storing, saving, etc. to a data structure language data identifying at least one of one or more speaker, merely reading and output of such data if extant. However, ¶ 322 of the instant specification does not require that a particular speaker identity be saved in an object array, array list, database etc. merely that “In implementations in which the speaker is detected, the language data may be associated with the detected speaker. Here, speech-recognition method 240 may embody a method to store speaker-resolved language data in a data structure of the computer system. In implementations in which the topic is detected, the language data may be associated with the detected topic.” Thus Kal in view of Wang is not considered to explicitly recite recording language data with respect to an identified user in a data structure, Kal in view of Wang at best shows output of speaker recognition data (see Wang: figures, 12-15, 18, etc.)

In a related field of endeavor Whee teaches a system and method for determining parameters of a user utterance the determined parameters including determining of a user identity based on recognition processing of the utterance (Whee: Col. 25, ll. 65 – Col. 26, ll. 33), wherein the system operates to store semantically resolved data in a data structure in a computer system, the method comprising: receiving audio data recording speech from one or more speakers (id.: system receives audio data corresponding to a spoken utterance of a user); 
converting the audio data into a linguistic representation comprising words or phonemes corresponding to the recorded speech (see Whee: Col. 25, ll. 65 – Col. 26, ll. 33; figs 12B, 13B, 14A, etc.: server performs ASR, speech processing, etc. on a received utterance, audio data thereof, to determine a user that spoke the utterance, a profile associated with the user, the identify of a device of the user and a user profile based thereon); 
receiving non-linguistic data associated with at least one of the one or more speakers (see Whee: Col. 25, ll. 65 – Col. 26, ll. 33; figs 12A, 13A, 14B, etc.: server receives input audio from a user speech device and additionally receives linguistic data in the form or content text as well as non-linguistic user metadata from an application server); 
and variously utilizing data structures for detecting, determining, communicating, etc. the user speech, text data; user metadata; topic identification data; target computer data, etc. upon, among, etc. the server, application server, user device, etc. (see Whee: Col. 6, ll. 46-56, Col. 13, ll. 32-46, Col. 20, ll. 22-53, Col. 21, ll. 38-Col. 22, ll. 34, etc.; figs 2, 3, 5, 6, 8, etc.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to adopt well-known data structures such as those utilized by Whee within the Kal in view of Wang system and method. The average skilled practitioner would have been motivated to do so for at least the purpose of persisting, communicating, etc. determined data such as the Kal emotional state data, Wang speaker recognition and intent data, Whee speaker intent and target computer data etc. and would have expected only predictable results therefrom.

In re Claim 2, Kal in view of Wang in view of Whee teaches or suggests wherein linguistic representation and the non-linguistic data are channels differing in assessed confidence levels, and wherein the sensor-fusion machine-learning model is configured to weight the channels in dependence on the assessed confidence levels (Kal: Col. 9, ll. 16-45; Fig 1D: system determines confidence level for plurality of channels comprising linguistic and non-linguistic representations, weights individual classifiers to maximize performance). The claim is considered obvious Kal as modified by Wang and Whee as addressed in the base claim as it would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to utilize the Kal taught weighting of outputs of the various modules of the Kal, Wang and Whee modified device, method, etc. for at least the purpose of optimizing performance of the system, method etc. The average skilled practitioner would have expected only predictable results therefrom.

In re Claim 3, Kal in view of Wang in view of Whee teaches or suggests wherein the speaker is identified based on directional microphony (Kal: col. 3: ll. 34-45 and col. 4: ll. 49-56); (Wang: ¶ 111-113, 213-223, 239-247). The claim is considered obvious over Kal as modified by Wang and Whee as addressed in the base claim as it would have been obvious to apply the further teaching of Kal, Wang, and/or Whee to the modified device of Kal, Wang and Whee.

In re Claim 4, Kal in view of Wang in view of Whee teaches or suggests wherein the speaker is identified based on a voiceprint (Kal: col. 3: ll. 34-45 and cols. 9-10: ll. 62-48); (Wang: ¶ 111-113, 213-223, 239-247). The claim is considered obvious over Kal as modified by Wang and Whee as addressed in the base claim as it would have been obvious to apply the further teaching of Kal, Wang, and/or Whee to the modified device of Kal, Wang and Whee.

In re Claim 5, Kal in view of Wang in view of Whee teaches or suggests wherein identifying the speaker includes storing the voiceprint of the speaker during a calibration phase and matching the stored voiceprint to a post-calibration voiceprint acquired from the audio data (Kal: cols. 9-10: ll. 17-48); (Wang: ¶ 111-113, 202,  213-223, 239-247). The claim is considered obvious over Kal as modified by Wang and Whee as addressed in the base claim as it would have been obvious to apply the further teaching of Kal, Wang, and/or Whee to the modified device of Kal, Wang and Whee.

In re Claim 6, Kal in view of Wang in view of Whee teaches or suggests wherein the speaker is identified based on face recognition (Kal: col. 3: ll. 46-51); (Wang: ¶ 111-113, 202,  213-223, 239-247: a speaker recognition engine includes facial recognition engine 1322). The claim is considered obvious over Kal as modified by Wang and Whee as addressed in the base claim as it would have been obvious to apply the further teaching of Kal, Wang, and/or Whee to the modified device of Kal, Wang and Whee.

In re Claim 7, Kal in view of Wang in view of Whee teaches or suggests wherein the speaker is identified based on posture analysis (Kal: col. 3: ll. 46-51: recognition of body position and motion); (Wang: ¶ 308: system includes a pose recognizer). The claim is considered obvious over Kal as modified by Wang and Whee as addressed in the base claim as it would have been obvious to apply the further teaching of Kal, Wang, and/or Whee to the modified device of Kal, Wang and Whee.

In re Claim 8, Kal in view of Wang in view of Whee teaches or suggests wherein the speaker is identified based on semantic analysis of the linguistic representation of the recorded speech (see col. 4: ll. 15-32: linguistic and semantic context analyzed); (Wang: ¶ 111-113, 202,  213-223, 239-247, 302: system identifies a user based on voice biometrics, etc. the biometrics operable in concert with an understanding and interpretation components and operable to determine a particular user as well as the contains on said user in operation of the system). It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to include aspects of the Kal and Wang in view of Whee taught semantic analysis within the speaker recognition method. The average skilled practitioner would have been motivated to do so for at least the purpose of user verification, identification, preference and constraint processing, low compute determination of intent, etc. and would have expected only predictable results therefrom.

In re Claim 9, Kal in view of Wang in view of Whee teaches or suggests wherein converting the audio data includes filtering candidate linguistic representations of the recorded speech based on a corpus associated with the identified speaker (Kal: cols. 3-4: ll. 10-56 and cols. 9-10: ll. 62-48; see also col. 13: ll. 53-57: system maintains a dictionary of recognized utterances); (Wang: ¶ 111-113, 137-142, 202,  213-223, 239-247, 302, 348: speech filtered in concert with a corpus of commands operable in concert with user preferences to allowable only particular users to issue commands within the corpus, that is a list of commands is persisted with respect to a user to whom those commands are relevant; the system also persists a user dialog history, domain ontology and corpus by which the system curates the user intent upon the ontology using the corpus). The claim is considered obvious over Kal as modified by Wang and Whee as addressed in the base claim as it would have been obvious to apply the further teaching of Kal, Wang, and/or Whee to the modified device of Kal, Wang and Whee.

In re Claim 10, Kal in view of Wang in view of Whee teaches or suggests wherein the audio data is converted to a natural language linguistic representation via a previously-trained natural language machine (see cols. 3-4: ll. 10-56 and cols. 9-10: ll. 62-48; see also col. 13: ll. 53-57); (Wang: ¶ 73-79, 91, 110-122, etc.: system operates and teaches the utility as well known, trained natural language models for the processing of an input user speech).  The claim is considered obvious over Kal as modified by Wang and Whee as addressed in the base claim as it would have been obvious to apply the further teaching of Kal, Wang, and/or Whee to the modified device of Kal, Wang and Whee.

In re Claim 11, Kal discloses a method, data structure, etc.  to store semantically resolved data (Kal: see FIGS. 1A and 1D; col. 2: ll. 9-39; and cols. 2-3: ll. 40-9) in a data structure in a computer system (Kal: see FIGS. 7-8 and cols. 10-13: ll. 49-57), the method comprising: 
receiving audio data recording speech from one or more speakers (Kal: see FIG. 1A: via sensors 102 and cols. 3-4: ll. 10-56); 
converting the audio data into a linguistic representation comprising words or phonemes corresponding to the recorded speech (Kal: see FIG. 1A: linguistic features 111 and cols. 3-4: ll. 10-56);  
receiving non-linguistic data (Kal: see FIG. 1A: context features 105, acoustic features 107, visual features 109, physical features 113) associated with at least one of the one or more speakers (see cols. 3-4: ll. 10-56); 
via a sensor-fusion (as per ¶ 154, etc. of the instant specification sensor fusion is considered a cooperative application-of plural sensory or contextual inputs as such any two inputs of the inputs disclosed by Kal may comprise the recited “sensor-fusion”) machine-learning model trained previously for user emotional detection (Kal: cols. 2: ll. 32-47; 7: ll. 34-61; 9-10: ll. 62-48; 13: ll. 53-57: multimodal analysis system uses linguistic cues such as the meaning of words and other contextual topics such as game state, play, etc. to determine, classify, etc. an  underlying user emotional state and the relation of the emotional state to the meaning of the utterance), 
concurrently processing the words or phonemes of the linguistic representation and the non-linguistic data (Kal: id. and see also col. 7: ll. 34-40; FIG. 1A, 1D: via 112’) to distinguish one of the one or more speaker emotions based at least in part on linguistic and semantic analysis, representation, etc. (id.); and 
committing language data based on the linguistic representation to a data structure, (Kal: see Col 7. ll. 42-64, 9-10: ll. 17-48, col. 13, ll. 15-58: FIGS. 1A-1D; Table I, II: determined emotional state parameters, etc. added to a machine learning model for classification thereof subsequent dimensional reduction and determination, modification, etc. of a user emotional state; additionally, determined state and other input data used to update a data structure).

Thus Kal teaches a data structure comprising the necessary structure and parameters to perform the claimed subject matter but does not explicitly teach the data structure operative in a system trained to detect topics and assign corresponding linguistic representations to detected topics as claimed.

In a related field of endeavor Wang teaches a system and method comprising: receiving audio data recording speech from one or more speakers (Wang: ¶ 61-65, 85, 86; Fig 1: a virtual personal assistant operable to determine user intent, user emotional state and voice characteristics based on analysis of input user speech and other user parameters); 
converting the audio data into a linguistic representation comprising words or phonemes corresponding to the recorded speech (Wang: ¶ 61-65, 85, 86, 213-216; Fig 1: system accepts audio input and determines words therein, vocal characteristics thereof); 
receiving non-linguistic data associated with at least one of the one or more speakers (id.  e.g. the determination of voice biometrics, characteristics etc. as well as facial expressions etc.); 
via a sensor-fusion machine learning model trained previously for topic identification, detection, etc., concurrently processing the words or phonemes of the linguistic representation and the non-linguistic data to detect a topic corresponding to the linguistic representation (Wang: ¶ 85, 111-115, 175-183, 213-216, 222, 223, 239-242, 245-256; Fig 5, 10, 18: multi-modal interpretation component identifies a speaker and determines intent thereof, the determination of intent utilizing linguistic components of the speech such as speech audio, passphrase, etc. as well as non-linguistic components of the speech such as voice biometrics, statistics, and/or other voice characteristics, components, parameters based on pre trained voice recognition models, the determination of intent encompassing particular topics, learned rules based thereon and user relations thereto); and 
committing language data based on the linguistic representation, the language data identifying the topic detected (id. And ¶ 3, 4, 19-21, etc.: intent determined with respect to a user command or other intent upon structures such as that of Analyzer 1800 which provides, outputs, etc. speaker identification and input speech content intent to other devices and/or systems by output of a command and speaker determination at 1838, said command and speaker data stored or persisted in the computer system at least in to form of output to a buffer, memory, etc.).

It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to augment the Kal taught data structure to accommodate the various modalities (facial, voiceprint, body language, ontology, etc. recognition) of the Wang system and method including the determination, identification of a speaker, and/or an intent, topic, etc. thereof. The average skilled practitioner would have been motivated to do so for at least the purpose of: optimizing particular signal processing or system operations with respect to an identified speaker, performing commands based thereon, etc. and would have expected only predictable results therefrom.

Kal in view of Wang can be considered to strongly suggest the recited subject matter in as much as the speaker dependent and speaker adapted models of Wang necessarily bear identifying data of the particular speaker upon which the model depends (see at least Wang ¶ 220-223, etc.) and by which a particular model may be invoked based on input data including audio input analysis, video input analysis and/or tactile analysis of a user input said input data operative to perform command and speaker determinations with regard to a specific identified user (Wang: ¶ 191-198, 237, 252-256, etc.; fig 11, 18, etc.) wherein said model determines a user intent which is considered to meet the broadest reasonable interpretation of the recited topic (that is, something about which a user may ask, request or otherwise instruct a computer to accomplish). As such, Kal in view of Wang implicitly associates a speaker dependent and/or speaker adaptive model with a particular identified speaker and intent, topic, etc. thereof but does not explicitly teach the committing, storing, saving, etc. to a data structure language data identifying at the topic detected, merely reading and output of such data if extant. Thus Kal in view of Wang is not considered to explicitly recite recording topic data nin a data structure, Kal in view of Wang at best shows output of such data (see Wang: figures 5, 10 etc.)

In a related field of endeavor Whee teaches a system and method for determining parameters of a user utterance the determined parameters including determining of a user identity based on recognition processing of the utterance (Whee: Col. 25, ll. 65 – Col. 26, ll. 33), wherein the system operates to store semantically resolved data in a data structure in a computer system, the method comprising: receiving audio data recording speech from one or more speakers (id.: system receives audio data corresponding to a spoken utterance of a user); 
converting the audio data into a linguistic representation comprising words or phonemes corresponding to the recorded speech (see Whee: Col. 25, ll. 65 – Col. 26, ll. 33; figs 12B, 13B, 14A, etc.: server performs ASR, speech processing, etc. on a received utterance, audio data thereof, to determine a user that spoke the utterance, a profile associated with the user, the identify of a device of the user and a user profile based thereon); 
receiving non-linguistic data associated with at least one of the one or more speakers (see Whee: Col. 25, ll. 65 – Col. 26, ll. 33; figs 12A, 13A, 14B, etc.: server receives input audio from a user speech device and additionally receives linguistic data in the form or content text as well as non-linguistic user metadata from an application server); 
and variously utilizing data structures for detecting, determining, communicating, etc. the user speech, text data; user metadata; topic identification data; target computer data, etc. upon, among, etc. the server, application server, user device, etc. (see Whee: Col. 6, ll. 46-56, Col. 13, ll. 32-46, Col. 20, ll. 22-53, Col. 21, ll. 38-Col. 22, ll. 34, etc.; figs 2, 3, 5, 6, 8, etc.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to adopt well-known data structures such as those utilized by Whee within the Kal in view of Wang system and method. The average skilled practitioner would have been motivated to do so for at least the purpose of persisting, communicating, etc. determined data such as the Kal emotional state data, Wang speaker recognition and intent data, Whee speaker intent and target computer data etc. and would have expected only predictable results therefrom.

In re Claim 12, Kal in view of Wang in view of Whee teaches or suggests wherein converting the audio data includes filtering based on semantic comparison of the linguistic representation against the detected topic (Kal: cols. 3-4: ll. 10-56 and cols. 9-10: ll. 62-48; see also col. 13: ll. 53-57); (Wang: ¶ 111-115, 157, 213-216, 222, 223, 239-242, 247, 300, 325).  The claim is considered obvious over Kal as modified by Wang and Whee as addressed in the base claim as it would have been obvious to apply the further teaching of Kal, Wang, and/or Whee to the modified device of Kal, Wang and Whee.

In re Claim 13, Kal in view of Wang in view of Whee teaches or suggests wherein the topic is detected in a trained machine-learning module by semantic analysis of the linguistic representation  (see col. 4: ll. 15-32: linguistic and semantic context analyzed); (Wang: ¶ 111-113, 202,  213-223, 239-247, 302: system identifies a user based on voice biometrics, etc. the biometrics operable in concert with an understanding and interpretation components and operable to determine a particular user as well as the contains on said user in operation of the system). The claim is considered obvious over Kal as modified by Wang and Whee as addressed in the base claim as it would have been obvious to apply the further teaching of Kal, Wang, and/or Whee to the modified device of Kal, Wang and Whee.

In re Claim 14, Kal in view of Wang in view of Whee teaches or suggests further comprising identifying a speech target corresponding to the linguistic representation (Kal: cols. 3-4: ll. 10-56 and cols. 9-10: ll. 62-48); (Wang: ¶ 63, 199, 201, etc.). It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to augment the Kal in view of Wang in view of Whee system, method, etc. to include the Wang taught method to resolve a particular target such as that determined using the linguistic, image, etc. analysis of Wang. The average skilled practitioner would have been motivated to do so for the purpose of combining known elements to achieve known results and would have expected only predictable results therefrom.

In re Claim 15, Kal in view of Wang in view of Whee teaches or suggests wherein the speech target is identified based on posture analysis (Kal: col. 3: ll. 46-51: recognition of body position and motion); (Wang: ¶ 63, 199, 201, etc.: the system operates to perform pose analysis and image analysis to determine a potential target). It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to utilize the Wang taught posture and pose analyses to resolve a particular target such as that determined using the image analysis of Wang in the Kal, wang and Whee system, method, etc. The average skilled practitioner would have been motivated to do so for the purpose of combining known elements to achieve known results and would have expected only predictable results therefrom.

In re Claim 16, Kal in view of Wang in view of Whee teaches or suggests wherein the speech target is identified based on facial recognition (see col. 3: ll. 46-51); (Wang: ¶ 111-113, 202,  213-223, 239-247: a speaker recognition engine includes facial recognition engine 1322). The claim is considered obvious over Kal as modified by Wang and Whee as addressed in the base claim and claim 15 supra as it would have been obvious to apply the further teaching of Kal, Wang, and/or Whee to the modified device of Kal, Wang and Whee.

In re Claim 17, Kal in view of Wang in view of Whee teaches or suggests wherein the speech target includes the computer system (Kal: FIGS. 7-8 and cols. 10-13: ll. 49-57); (Wang: ¶ 111-115, 157, 213-216, 222, 223, 239-242, 247, 300, 325: the speech target determination includes the computer system as well as indicating the computer system of a target, such as the phone thereof). It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to utilize the Wang taught computer system on a plurality of computer systems of a plurality of users of the Kal in view of Wang in view of Whee system and method. The average skilled practitioner would have been motivated to do so for the purpose of  distributing tools whereby users might communicate over distances, and/or across geographical, language, contextual, etc. diversities and would have expected only predictable results therefrom.

In re Claim 18, Kal in view of Wang in view of Whee teaches or suggests further comprising backfilling previously unresolved linguistic elements of the data structure based on the identified speech target (Kal  col. 3: ll. 10-33, col. 4: ll. 15-32, cols. 9-10: ll. 17-48, col. 13: ll. 53-57: the system maintains a dictionary based on recognized and/or disambiguated speech); (Wang: ¶ 403, 416, 426, etc.: system resolve ambiguities, requests disambiguation, etc.). It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to utilize the Kal, Wang, or Whee taught methods of disambiguation to maintain the Kal in view of Wang in view of Whee dictionary, data structure etc. The average skilled practitioner would have been motivated to do so for the purpose of  backfilling a dictionary, data structure, etc. based on learned words, contexts, etc. and would have expected predictable results therefrom.

In re Claim 19, Kal in view of Wang in view of Whee teaches or suggests further comprising backfilling previously unresolved linguistic elements of the data structure based on the detected topic (see col. 3: ll. 10-33, col. 4: ll. 15-32, cols. 9-10: ll. 17-48, col. 13: ll. 53-57); (Wang: ¶ 111-115, 157, 213-216, 222, 223, 239-242, 247, 300, 325, 403, 416, 426). The claim is considered obvious over Kal as modified by Wang and Whee as addressed in the base claim and claim 18 supra as it would have been obvious to apply the further teaching of Kal, Wang, and/or Whee to the modified device of Kal, Wang and Whee.

In re Claim 20, Kal discloses a method, data structure, etc. to store semantically resolved language data (Kal: see FIGS. 1A and 1D; col. 2: ll. 9-39; and cols. 2-3: ll. 40-9) in a data structure in a computer system (Kal: see FIGS. 7-8 and cols. 10-13: ll. 49-57: one or more context features 105, acoustic features 107, visual features 109, linguistic features 111, and physical features 113 of user may be derived from signals obtained by one or more sensors 102 t determine emotional state of a user(s) during issue of instructions to a computer and augment the response of the computer thereby), and thereby to execute computer-actionable directives in concert with information conveyed in human speech (Kal: FIGS. 1A, 1D; cols. 3-4: ll. 10-56;  cols. 9-10: ll. 17-48, col. 13, ll. 15-58: system operates to maintain a user state and optionally execute signal processing, speech recognition, etc. instructions and update a data structure based on determined state and other input data), the method comprising: 
receiving audio data recording speech from one or more speakers (Kal: see FIG. 1A: via sensors 102 and cols. 3-4: ll. 10-56); 
converting the audio data into a linguistic representation comprising words or phonemes corresponding to the recorded speech (Kal: see FIG. 1A: linguistic features 111 and cols. 3-4: ll. 10-56);  
receiving non-linguistic data (Kal: see FIG. 1A: context features 105, acoustic features 107, visual features 109, physical features 113) associated with at least one of the one or more speakers (see cols. 3-4: ll. 10-56); 
detecting a targeted emotion class corresponding to the linguistic representation (Kal: cols. 3-4: ll. 10-56 and cols. 9-10: ll. 62-48); via a sensor-fusion machine-learning model trained previously for target emotion identification (Kal: cols. 2: ll. 32-47; 7: ll. 34-61; 9-10: ll. 62-48; 13: ll. 53-57: multimodal analysis system uses linguistic cues such as the meaning of words and other contextual topics such as game state, play, etc. to determine, classify, etc. an  underlying user emotional state and the relation of the emotional state to the meaning of the utterance), 
to identify the target state by concurrently processing the words or phonemes of the linguistic representation and the non-linguistic data (Kal: id. and see also col. 7: ll. 34-40; FIG. 1A, 1D: via 112’) to identify a target emotion corresponding to the linguistic representation (Kal: cols. 2: ll. 32-47; 7: ll. 34-61; 9-10: ll. 62-48; 13: ll. 53-57); 
committing to the data structure language data associated with the identified target emotion and based on the linguistic representation (Kal: id.: via determined emotional state 115 and/or change state 110 and in concert with linguistic and semantic data).

Kal does not explicitly teach detecting, identifying, etc. a target computer corresponding to the linguistic representation via a sensor-fusion machine-learning model trained previously for target identification; to thereby identify a target computer from among plural targets of the recorded speech corresponding to the linguistic representation; committing to the data structure language data associated with the identified target computer and based on the linguistic representation; parsing the data structure to identify in the language data one or more of the computer actionable directives actionable by a computer identified as the target; and submitting the one or more directives to the target computer for processing.

In a related field of endeavor Wang teaches
A method to execute computer-actionable directives conveyed in human speech (Wang: ¶ 61-65, 85, 86; Fig 1: a virtual personal assistant operable to determine user intent, user emotional state and voice characteristics based on analysis of input user speech and other user parameters and to execute commands therein based thereon);
the method comprising: receiving audio data recording speech from one or more speakers (Wang: id); 
converting the audio data into a linguistic representation comprising words or phonemes corresponding to the recorded speech (Wang: ¶ 61-65, 85, 86, 213-216; Fig 1: system accepts audio input and determines words therein, vocal characteristics thereof);
receiving non-linguistic data associated with at least one of the one or more speakers  (id.  e.g. the determination of voice biometrics, statistics, characteristics etc. as well as facial expressions etc.); via a sensor-fusion machine-learning model trained previously for target identification (id.  e.g. the determination of voice biometrics, characteristics etc. as well as facial expressions etc.),  concurrently processing the words or phonemes of the linguistic representation and the non-linguistic data (Wang: ¶ 85, 111-113, 213-216, 222, 223, 239-242, 245-256; Fig 18: multi-modal interpretation component identifies a speaker and determines intent thereof, the determination of intent utilizing linguistic components of the speech such as speech audio, passphrase, etc. as well as non-linguistic components of the speech such as voice biometrics, statistics and/or other voice characteristics, components, parameters, etc. based on pre trained voice recognition models, etc. the determination of intent encompassing particular topics, learned rules based thereon and user relations thereto and operative to instruct the computer to perform particular commands)
to identify a target computer corresponding to the linguistic representation  (Wang: ¶ 63, 113, 199, 201, 424, 473, 528-534, etc.: system operates to determine an object, person, or other target of a command intention a string of sounds such as “please call John”, executes a command with respect to the target computer, phone, etc. of John and/or determining the target of a term “him” in the command “call him” or to access operations of a service robot) from among plural targets of the recorded speech (id. the computer, phone, etc. target corresponding to the target John, him, etc. and/or access particular target operations of a target service robot); 
committing to a data structure language data associated with the identified target computer thereof, and based on the linguistic representation (Wang: ¶ 85, 111-113, 213-216, 222, 223, 239-242, 245-256, 528-534; Fig 18: system accesses previously trained recognition models with respect to an identified, determined, intent, topics, etc. processes user input language based thereon and updates a structure in memory therewith in this way the system issues a directive to the target computer of John or him to ring, and/or issues a directive to a service robot to perform one or more directives borne in the user language data); parsing the data structure to identify in the language data one or more directives actionable by the target computer (Wang: ¶ 199, 424, 473, 528-534: system performs an action with respect to a recognized target and computer associated therewith such as receiving a phone call; activating functionality of a service robot, etc.); 
and submitting the one or more directives to the target computer for processing (Wang: 528-534: spoken commands operate particular functionality on the service robot).

It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to augment the Kal taught data structure to accommodate the various modalities (facial, voiceprint, body language, ontology, target, etc. recognition, resolution, etc.) of the Wang system and method including the determination of intent, topic, etc., the determination of a command object or target and the execution of an instruction based thereon. The average skilled practitioner would have been motivated to do so for at least the purpose of: optimizing particular signal processing with respect to executing an intent of an identified speaker, with respect to an object, target, etc. thereof and would have expected only predictable results therefrom.

Kal in view of Wang does not explicitly teach committing language data based on the linguistic representation to a data structure, the language data being associated with the target computer.

In a related field of endeavor Whee teaches a system and method for determining parameters of a user utterance the determined parameters including determining of a user identity based on recognition processing of the utterance (Whee: Col. 25, ll. 65 – Col. 26, ll. 33), wherein the system operates to store semantically resolved data in a data structure in a computer system, the method comprising: receiving audio data recording speech from one or more speakers (id.: system receives audio data corresponding to a spoken utterance of a user); 
converting the audio data into a linguistic representation comprising words or phonemes corresponding to the recorded speech (see Whee: Col. 25, ll. 65 – Col. 26, ll. 33; figs 12B, 13B, 14A, etc.: server performs ASR, speech processing, etc. on a received utterance, audio data thereof, to determine a user that spoke the utterance, a profile associated with the user, the identify of a device of the user and a user profile based thereon); 
receiving non-linguistic data associated with at least one of the one or more speakers (see Whee: Col. 25, ll. 65 – Col. 26, ll. 33; figs 12A, 13A, 14B, etc.: server receives input audio from a user speech device and additionally receives linguistic data in the form or content text as well as non-linguistic user metadata from an application server); 
and variously utilizing data structures for detecting, determining, communicating, etc. the user speech, text data; user metadata; topic identification data; target computer data, etc. upon, among, etc. the server, application server, user device, etc. (see Whee: Col. 6, ll. 46-56, Col. 13, ll. 32-46, Col. 20, ll. 22-53, Col. 21, ll. 38-Col. 22, ll. 34, etc.; figs 2, 3, 5, 6, 8, etc.).
It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to adopt well-known data structures such as those utilized by Whee within the Kal in view of Wang system and method. The average skilled practitioner would have been motivated to do so for at least the purpose of persisting, communicating, etc. determined data such as the Kal emotional state data, Wang speaker recognition and intent data, Whee speaker intent and target computer data etc. and would have expected only predictable results therefrom.

Response to Arguments
Applicant's arguments filed 6/15/22 have been fully considered but they are not persuasive nevertheless in the interest of compact prose. 
Applicant’s arguments, see Claims and Remarks, filed 6/15/22, with respect to the rejection(s) of claim(s) 1, 11, 20 under  35 USC 103 over Kalini-Akbacak in view of Wang have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in view of Kalini-Akbacak, Wang and Wheeler.

Applicant argues, with regard to Kalini-Akbacak (Kal), that Kal “uses the ML model only to determine the emotional state.” Applicant holds that acoustical features of voice may not be salient with respect to voice recognition as well as speaker identification but provides no evidence therefor. In this regard Applicant holds that the parameters, features, etc. of Kal are insufficient for speaker recognition. Further Applicant objects to Examiner’s assertion that acoustical features of voice are salient in voice recognition as well as speaker identification but provides no evidence therefor. In this regard Applicant holds that the data of Kal is insufficient for speaker recognition. Applicant presents arguments with respect to “labels.” Applicant argues that Kal in view of Wang does not teach the amended claimed subject matter particularly “committing language data to a data structure, the language data identifying the at least one of the one or more speakers.”
Applicant then points out perceived deficiencies in Wang. Particularly that “by separately illustrating and separately describing the functions of components 414 and 416, Wang reveals that sensor fusion is not the contemplated mechanism for speaker recognition,” and holds that the effect of the combination is to recognize speech from different speakers in a matter that is overall agnostic to the speaker. Applicant then prefers that Wang does not disclose fusion of linguistic and non-linguistic inputs,” for speaker recognition nor for any purpose whatsoever. Applicant next convincingly argues that Wang “establishes that different speaker-recognition technologies were practiced in advance of Applicant's disclosure,” but alleges in conclusion that, “none of the technologies in Wang are sensor-fusion technologies…” and holds that a combination of Kal and Wang would be incorrect based on MPEP 2143.01 VI.

	Examiner respectfully disagrees. While Examiner appreciates Applicants construal of the claims and references, Applicant must appreciate that the represent one among myriad possible construals of the references and claimed subject matter – Applicant’s preferred construal. In Examiner’s broadly reasonable construal Kal discloses a plurality of ML models; indeed to ultimately assess a user emotional state, but also to classify features, models, etc. of the Kal system and method collected by a plurality of sensors (see Kal: Fig 1A-1D; Table 1). Kal teaches extraction of acoustical features sufficient for voice recognition, speaker identification and a variety of other learning tasks at least in the form of Mel Frequency Cepstral Coefficients (MFCC). Indeed the instant specification, Kal and Wang both disclose extraction of acoustic features, vectors thereof, from audio; said features comprising a well-known parameter set, particularly mel frequency cepstral coefficients (MFCC: compare the specification as filed 44-46; Kal: Col 3, ll. 33-45, etc.; Wang: ¶ 309, etc.: “…voice recognition algorithms may use Mel-frequency cepstral coefficients to identify the speaker of particular vocal features.”). Applicant next presents arguments with respect to “labels” which are unclaimed and unnecessary to any broadly reasonable interpretation of the instant claims.
In Examiners broadly reasonable interpretation based on but not importing language from the specification as filed sensor fusion is considered in light of ¶ 154 of the instant specification as “plural sensory or contextual inputs.” Examiner must consider sensor fusion as one or more sensory inputs and/or one or more contextual inputs. Further, absent clear definition or claiming the plural inputs need not occupy a particular position in the signal processing chain. Kal and Wang each separately disclose a sensor fusion system which utilizes plural sensory or contextual inputs. In the case of the Kal emotion detection system, plural sensory and/or contextual inputs are shown in figures 1A-1D and drive a machine learning model, plural classifier, etc. In the Wang system and method for speaker determination, recognition, etc. plural audio, video, tactile, contextual, etc. inputs are shown in figure 12-15 and particularly in the speaker identification system of figures 18. Further, the Wang taught speaker determination method of figure 18 uses acoustic speech features such as MFCC in concert with neural network determined contextual features to analyze input audio samples and determine a speaker thereof. Thus the speaker recognizer of figure 18 accepts both sensory and contextual inputs in the form of a speech model and statistics respectively (Wang: ¶ 113, 245-256; Fig 18, etc.: system identifies speaker and subsequently determines speaker topic, intent, target, etc.). As such both Kal and Wang are considered to operate sensor fusion in as much as they each utilize plural sensory or contextual inputs. MPEP 2143.01 VI states that the proposed modification cannot change the principles of operation of the reference. Kal in view of Wang does not alter any principles of operation. Kal teaches a recognizer, which utilizes a plurality of audio, video, etc. inputs to derive parameters of the input audio including acoustic parameters of audio in the form of MFCC of input speech and uses said inputs to determine a user emotional state (please see claim 1 supra). Wang additionally teaches a learning system which utilizes a plurality of audio, video, etc. inputs to derive parameters of the input audio including acoustic parameters of audio in the form of MFCC of input speech and more broadly determines a plurality of user parameters, said parameters incorporating a result from the determining of a user emotion (Wang: Fig 17, 20: emotion classifier 1712n, 2012n among a plurality of classified results of sensor fusion in the form of feature combining and conditioning). The Wang system additionally teaches that the input audio parameters and/or an acoustic model based thereon operate for speaker determination to jointly determine the content of input audio, the speaker thereof, and an intent, topic, target, etc. of the speech/speaker (Wang: ¶ 113, 245-256; fig 18) and thereby assign a particular identity to a particular command. As such Examiner considers Kal and Wang to be in a related field of endeavor and to be operable in concert at least in the employ of speech and other inputs to generate MFCC and contextual data for the purpose of identifying a speaker and speech, etc. parameters thereof. As such applicant’s arguments with regard to claim 1 are not persuasive.
Applicant holds that “the thrust” of claim 1 “is to assign recorded words or phonemes to the appropriate speaker.” Examiner appreciates this interpretation however the claimed language makes this interpretation entirely optional. The claim states “committing language data based on the linguistic representation to a data structure, the language data identifying the at least one of the one or more speakers.” Examiner has accepted Applicant’s construal in as much as the rejection supra addresses speaker identification however the mere input of speech audio could be held to meet a broadly reasonable interpretation of committing language data based on a linguistic representation to a data structure in as much as audio samples are conveyed to a buffer, and further the audio samples may be considered to inherently comprise language data identifying at least one of one or more speakers in the form of particular voice-prints, particular language content and patterns, distinguishing cadences and inflections etc. Nevertheless, Kal in view of Wang is not considered to explicitly recite recording language data with respect to an identified user in a data structure as Kal in view of Wang at best shows output of speaker recognition data and is silent regarding the receipt, encoding, etc. of the output (see Wang: figures, 12-15, 18, etc.: particularly the recognizer of fig 18 outputs language data based on the linguistic representation identifying at least one speaker but does not specify the composition or receiver of the output, presumably at least a data structure in the form of a buffer, memory or other downstream computational structure)

Applicant argues with respect to claim 11 that the prior art rejection over Kal and Wang does not teach topic detection. The specification as filed does not explicitly define a topic nor does it define an intent. For the purpose of the rejection supra Examiner considers a user intent substantially similar to a topic. As such applicant’s arguments with regard to claim 1 are not persuasive. Nevertheless, as stated with respect to the arguments regarding claim 1, Kal in view of Wang is not considered to teach recording language data with respect to a topic, intent, etc. in a data structure.

Applicant argues with respect to claim 20 that the prior art rejection over Kal and Wang does not teach target identification. The specification as filed does not explicitly define a topic nor does it define an intent. For the purpose of the rejection supra Examiner considers a user intent substantially similar to a topic, target of a communication, etc. As such applicant’s arguments with regard to claim 20 are not persuasive. Nevertheless, as stated with respect to the arguments regarding claim 1, Kal in view of Wang is not considered to teach identification of a target computer nor recording language data with respect to a target computer in a data structure.

Conclusion

Applicant's amendment necessitated the new ground(s) of rejection presented in this Office action.  Accordingly, THIS ACTION IS MADE FINAL.  See MPEP § 706.07(a).  Applicant is reminded of the extension of time policy as set forth in 37 CFR 1.136(a).  
A shortened statutory period for reply to this final action is set to expire THREE MONTHS from the mailing date of this action.  In the event a first reply is filed within TWO MONTHS of the mailing date of this final action and the advisory action is not mailed until after the end of the THREE-MONTH shortened statutory period, then the shortened statutory period will expire on the date the advisory action is mailed, and any extension fee pursuant to 37 CFR 1.136(a) will be calculated from the mailing date of the advisory action.  In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. 


Any inquiry concerning this communication or earlier communications from the examiner should be directed to PAUL C MCCORD whose telephone number is (571)270-3701. The examiner can normally be reached 730-630 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, VIVIAN CHIN can be reached on 5712727848. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/PAUL C MCCORD/Primary Examiner, Art Unit 2654