Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .

DETAILED ACTION
Claim Rejections - 35 USC § 103

In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.

Claims 1-20 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kalinli-Akbacak (US 9031293 hereinafter Kal, and further in view of Wang: 20180314689 and further in view of Kim: 20130304476.

In re Claim 1, Kal discloses a method, data structure, etc. to store speaker-resolved language data (Kal: see FIGS. 1A-1D; col. 2: ll. 9-39; and cols. 2-3: ll. 40-9) in a data structure in a computer system (Kal: see FIGS. 7-8 and cols. 10-13: ll. 49-57), the method comprising: 
receiving audio data recording speech from one or more speakers (Kal: see FIG. 1A-1D: via sensors 102 and cols. 3-4: ll. 10-56); 
converting the audio data into a linguistic representation comprising words or phonemes corresponding to the recorded speech (Kal: see FIG. 1A: linguistic features 111 and cols. 3-4: ll. 10-56); 
receiving non-linguistic data (Kal: see FIG. 1A: context features 105, acoustic features 107, visual features 109, physical features 113) associated with at least one of the one or more speakers (see cols. 3-4: ll. 10-56); 
via a sensor-fusion (as per ¶ 154, etc. of the instant specification sensor fusion is considered a cooperative application-of plural sensory or contextual inputs as such any two inputs of the inputs disclosed by Kal may comprise the recited “sensor-fusion”) machine-learning model (Kal: see Abstract; FIG. 1A-1D and col 1, ll. 13-34 col.3. ll. 10 – col 4. ll. 48, 7: ll. 34-40; Table I, II: sensor data such as acquired visual, acoustic, etc. data such as from cameras, microphones, etc. said acquired data operative in context with additional data and generative of attendant features used to identify one or more emotional categories of a particular user state) trained previously for speaker emotion identification (Kal: see cols. 9-10: ll. 17-48) 
concurrently processing the words or phonemes of the linguistic representation and the non-linguistic data (Id., and see FIG. 1D: via 112’, the processing in fig 1A: 108; Fig 1D: 108x and 112’ are processed in parallel and thus considered to meet the broadly reasonable interpretation of the recited concurrence) to distinguish one of the one or more speaker emotions (id.); of the at least one of the one or more speakers
recording language data based on the linguistic representation to a data structure, (Kal: see Col 7. ll. 42-64, 9-10: ll. 17-48, col. 13, ll. 15-58: FIGS. 1A-1D; Table I, II: determined emotional state parameters, etc. added to a machine learning model for classification thereof subsequent dimensional reduction and determination, modification, etc. of a user emotional state; additionally, determined state and other input data used to update a data structure).

Thus Kal teaches a data structure comprising the necessary structure and parameters to perform the claimed subject matter but for the determination of speaker emotion rather than speaker identification as such Kal does not explicitly discuss recording language data with respect to the identified speaker in a data structure based upon the linguistic representation, the language data identifying the at least one of the one or more speakers.
 
In a related field of endeavor Wang teaches a system and method comprising: receiving audio data recording speech from one or more speakers (Wang: ¶ 61-65, 85, 86; Fig 1: a virtual personal assistant operable to determine user intent, user emotional state and voice characteristics based on analysis of input user speech and other user parameters); 
converting the audio data into a linguistic representation comprising words or phonemes corresponding to the recorded speech (Wang: ¶ 61-65, 85, 86, 213-216; Fig 1: system accepts audio input and determines words therein, vocal characteristics thereof); 
receiving non-linguistic data associated with at least one of the one or more speakers (id.  e.g. the determination of voice biometrics, characteristics etc. as well as facial expressions etc.); 
via a sensor-fusion machine learning model trained previously for speaker identification, concurrently processing the words or phonemes of the linguistic representation and the non-linguistic data to distinguish as an identified speaker  the at least one of the one or more speakers from among the one or more speakers (Wang: ¶ 85-88, 111-115, 213-216, 222, 223, 239-242, 245-256, 307, 385-390; Fig 18: multi-modal interpretation component identifies a speaker and determines intent thereof, the speaker identification utilizing linguistic components of the speech such as a passphrase as well as non-linguistic components of the speech such as voice biometrics, image input, and/or other characteristics based on pre trained voice recognition models, the data acquired from a plurality of input devices such as a microphone, camera, or other inputs and operative to ascertain, verify, etc. a user identity by at least the user of stored joint speaker models; the determination of intent utilizing linguistic components of the speech such as speech audio, passphrase, etc. as well as non-linguistic components of the speech such as voice biometrics, statistics and/or other voice characteristics, components, parameters, etc. based on pre trained voice recognition models, etc. the determination of intent encompassing particular topics, learned rules based thereon and user relations thereto and operative to instruct the computer to perform particular commands; further the various modules of Wang are disclosed as operative in parallel and simultaneously across modalities); and
recording language data with respect to the identified speaker, wherein the language data is based upon the linguistic representations (id. And ¶ 3, 4, 19, 92, 113, 223: Analyzer 1800 provides, outputs, etc. speaker identification and input speech content and context data to other devices and/or systems by output of a command and speaker determination at 1838, said command and speaker data stored or persisted in the computer system at least in to form of output to at least a buffer, memory, etc.).

It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to augment, update, etc. the Kal taught data structure to additionally accommodate the various modalities and determinations (facial, voiceprint, body language, ontology, etc. recognition) of the Wang speaker recognition method. The average skilled practitioner would have been motivated to do so for at least the purpose of: optimizing particular signal processing with respect to an identified speaker, iteratively performing speaker diarization, etc. and would have expected only predictable results therefrom.

Kal in view of Wang can be considered to strongly suggest the recited subject matter in as much as the speaker dependent and speaker adapted models of Wang necessarily bear identifying data of the particular speaker upon which the model depends (see at least Wang ¶ 220-223, etc.) and by which a particular model may be invoked based on input data including audio input analysis, video input analysis and/or tactile analysis of a user input said input data operative to perform command and speaker determinations with regard to a specific identified user (Wang: ¶ 191-198, 237, 252-256, etc.; fig 11, 18, etc.). As such Kal in view of Wang implicitly associates a speaker dependent and/or speaker adaptive model with a particular identified speaker but does not explicitly teach the committing, storing, saving, etc. to a data structure language data identifying at least one of one or more speaker, merely reading and output of such data if extant. However, ¶ 322 of the instant specification does not require that a particular speaker identity be saved in an object array, array list, database etc. merely that “In implementations in which the speaker is detected, the language data may be associated with the detected speaker. Here, speech-recognition method 240 may embody a method to store speaker-resolved language data in a data structure of the computer system. In implementations in which the topic is detected, the language data may be associated with the detected topic.” Thus Kal in view of Wang is not considered to explicitly recite recording language data with respect to an identified user in a data structure, Kal in view of Wang at best shows output of speaker recognition data (see Wang: figures, 12-15, 18, etc.)

In a related field of endeavor Kim teaches a system and method operable for receiving audio data recording speech from one or more speakers (Kim: Abstract; ¶ 79-81, Fig 8: system receives speech audio of plural users and determines speech data thereof such as linguistic representations in the form of a speaker identity, topic, target, etc.); 
converting the audio data into a linguistic representation comprising words or phonemes corresponding to the recorded speech (Kim: ¶ 65-68, 70-73; Fig 3, 5: system receives speech of plural users, extracts voice data, user identification etc. thereof, as well as a user contextual data including gaze, activity, etc. and thereby encodes speaker identity, participant id, etc.); 
receiving non-linguistic data associated with at least one of the one or more speakers (Kim: 65-73; Fig 3-5: system determines plural parameters of contextual data for the plurality of users including gaze direction, participation dynamics, locations, etc.); and
recording language data with respect to the identified speaker in a data structure, wherein the language data is based upon the linguistic representation (Kim: ¶ 65-68, 70-73; Fig 3, 5: user profiles identifying a user and comprising language data based on speaker identification, speaking dynamics and context, meeting topic et. stored, updated, maintained, etc. in a data structure).

It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to adopt well-known data structures comprising linguistic, non-linguistic and other user identifying data such as those utilized by Kim within the Kal in view of Wang system and method. The average skilled practitioner would have been motivated to do so for at least the purpose of persisting, communicating, etc. determined data such as the Kal emotional state data; Wang speaker recognition and intent data; Kim speaker, topic and target data etc. and would have expected only predictable results therefrom. It would have been further obvious to one of ordinary skill in the art before the effective filing date of the instant application to adapt the Kal in view of Wang machine learning algorithms to accommodate the Kim data as an input. The average skilled practitioner would have been motivated to do so for the purpose of processing multi-modal data in an analysis system and would have expected only predictable results therefrom.

In re Claim 2, Kal in view of Wang in view of Kim teaches or suggests wherein linguistic representation and the non-linguistic data are channels differing in assessed confidence levels, and wherein the sensor-fusion machine-learning model is configured to weight the channels in dependence on the assessed confidence levels (Kal: Col. 9, ll. 16-45; Fig 1D: system determines confidence level for plurality of channels comprising linguistic and non-linguistic representations, weights individual classifiers to maximize performance). The claim is considered obvious Kal as modified by Wang and Kim as addressed in the base claim as it would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to utilize the Kal taught weighting of outputs of the various modules of the Kal, Wang and Kim modified device, method, etc. for at least the purpose of optimizing performance of the system, method etc. The average skilled practitioner would have expected only predictable results therefrom.

In re Claim 3, Kal in view of Wang in view of Kim teaches or suggests wherein the speaker is identified based on directional microphony (Kal: col. 3: ll. 34-45 and col. 4: ll. 49-56); (Wang: ¶ 111-113, 213-223, 239-247). The claim is considered obvious over Kal as modified by Wang and Kim as addressed in the base claim as it would have been obvious to apply the further teaching of Kal, Wang, and/or Kim to the modified device of Kal, Wang and Kim.

In re Claim 4, Kal in view of Wang in view of Kim teaches or suggests wherein the speaker is identified based on a voiceprint (Kal: col. 3: ll. 34-45 and cols. 9-10: ll. 62-48); (Wang: ¶ 111-113, 213-223, 239-247). The claim is considered obvious over Kal as modified by Wang and Kim as addressed in the base claim as it would have been obvious to apply the further teaching of Kal, Wang, and/or Kim to the modified device of Kal, Wang and Kim.

In re Claim 5, Kal in view of Wang in view of Kim teaches or suggests wherein identifying the speaker includes storing the voiceprint of the speaker during a calibration phase and matching the stored voiceprint to a post-calibration voiceprint acquired from the audio data (Kal: cols. 9-10: ll. 17-48); (Wang: ¶ 111-113, 202,  213-223, 239-247). The claim is considered obvious over Kal as modified by Wang and Kim as addressed in the base claim as it would have been obvious to apply the further teaching of Kal, Wang, and/or Kim to the modified device of Kal, Wang and Kim.

In re Claim 6, Kal in view of Wang in view of Kim teaches or suggests wherein the speaker is identified based on face recognition (Kal: col. 3: ll. 46-51); (Wang: ¶ 111-113, 202,  213-223, 239-247: a speaker recognition engine includes facial recognition engine 1322). The claim is considered obvious over Kal as modified by Wang and Kim as addressed in the base claim as it would have been obvious to apply the further teaching of Kal, Wang, and/or Kim to the modified device of Kal, Wang and Kim.

In re Claim 7, Kal in view of Wang in view of Kim teaches or suggests wherein the speaker is identified based on posture analysis (Kal: col. 3: ll. 46-51: recognition of body position and motion); (Wang: ¶ 308: system includes a pose recognizer). The claim is considered obvious over Kal as modified by Wang and Kim as addressed in the base claim as it would have been obvious to apply the further teaching of Kal, Wang, and/or Kim to the modified device of Kal, Wang and Kim.

In re Claim 8, Kal in view of Wang in view of Kim teaches or suggests wherein the speaker is identified based on semantic analysis of the linguistic representation of the recorded speech (see col. 4: ll. 15-32: linguistic and semantic context analyzed); (Wang: ¶ 111-113, 202,  213-223, 239-247, 302: system identifies a user based on voice biometrics, etc. the biometrics operable in concert with an understanding and interpretation components and operable to determine a particular user as well as the contains on said user in operation of the system). It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to include aspects of the Kal and Wang in view of Kim taught semantic analysis within the speaker recognition method. The average skilled practitioner would have been motivated to do so for at least the purpose of user verification, identification, preference and constraint processing, low compute determination of intent, etc. and would have expected only predictable results therefrom.

In re Claim 9, Kal in view of Wang in view of Kim teaches or suggests wherein converting the audio data includes filtering candidate linguistic representations of the recorded speech based on a corpus associated with the identified speaker (Kal: cols. 3-4: ll. 10-56 and cols. 9-10: ll. 62-48; see also col. 13: ll. 53-57: system maintains a dictionary of recognized utterances); (Wang: ¶ 111-113, 137-142, 202,  213-223, 239-247, 302, 348: speech filtered in concert with a corpus of commands operable in concert with user preferences to allowable only particular users to issue commands within the corpus, that is a list of commands is persisted with respect to a user to whom those commands are relevant; the system also persists a user dialog history, domain ontology and corpus by which the system curates the user intent upon the ontology using the corpus). The claim is considered obvious over Kal as modified by Wang and Kim as addressed in the base claim as it would have been obvious to apply the further teaching of Kal, Wang, and/or Kim to the modified device of Kal, Wang and Kim.

In re Claim 10, Kal in view of Wang in view of Kim teaches or suggests wherein the audio data is converted to a natural language linguistic representation via a previously-trained natural language machine (see cols. 3-4: ll. 10-56 and cols. 9-10: ll. 62-48; see also col. 13: ll. 53-57); (Wang: ¶ 73-79, 91, 110-122, etc.: system operates and teaches the utility as well known, trained natural language models for the processing of an input user speech).  The claim is considered obvious over Kal as modified by Wang and Kim as addressed in the base claim as it would have been obvious to apply the further teaching of Kal, Wang, and/or Kim to the modified device of Kal, Wang and Kim.

In re Claim 11, Kal discloses a method, data structure, etc.  to store semantically resolved data (Kal: see FIGS. 1A and 1D; col. 2: ll. 9-39; and cols. 2-3: ll. 40-9) in a data structure in a computer system (Kal: see FIGS. 7-8 and cols. 10-13: ll. 49-57), the method comprising: 
receiving audio data recording speech from one or more speakers (Kal: see FIG. 1A: via sensors 102 and cols. 3-4: ll. 10-56); 
converting the audio data into a linguistic representation comprising words or phonemes corresponding to the recorded speech (Kal: see FIG. 1A: linguistic features 111 and cols. 3-4: ll. 10-56);  
receiving non-linguistic data (Kal: see FIG. 1A: context features 105, acoustic features 107, visual features 109, physical features 113) associated with at least one of the one or more speakers (see cols. 3-4: ll. 10-56); 
via a sensor-fusion (as per ¶ 154, etc. of the instant specification sensor fusion is considered a cooperative application-of plural sensory or contextual inputs as such any two inputs of the inputs disclosed by Kal may comprise the recited “sensor-fusion”) machine-learning model trained previously for user emotional detection (Kal: cols. 2: ll. 32-47; 7: ll. 34-61; 9-10: ll. 62-48; 13: ll. 53-57: multimodal analysis system uses linguistic cues such as the meaning of words and other contextual topics such as game state, play, etc. to determine, classify, etc. an  underlying user emotional state and the relation of the emotional state to the meaning of the utterance), 
concurrently processing the words or phonemes of the linguistic representation and the non-linguistic data (Kal: id. and see also col. 7: ll. 34-40; FIG. 1A, 1D: via 112’) to distinguish one of the one or more speaker emotions based at least in part on linguistic and semantic analysis, representation, etc. (id.); and 
committing language data based on the linguistic representation to a data structure, (Kal: see Col 7. ll. 42-64, 9-10: ll. 17-48, col. 13, ll. 15-58: FIGS. 1A-1D; Table I, II: determined emotional state parameters, etc. added to a machine learning model for classification thereof subsequent dimensional reduction and determination, modification, etc. of a user emotional state; additionally, determined state and other input data used to update a data structure).

Thus Kal teaches a data structure comprising the necessary structure and parameters to perform the claimed subject matter but does not explicitly teach the data structure operative in a system trained to detect topics and assign corresponding linguistic representations to detected topics as claimed, as such Kal does not explicitly discuss recording language data with respect to the detected topic in a data structure based upon the linguistic representation.

In a related field of endeavor Wang teaches a system and method comprising: receiving audio data recording speech from one or more speakers (Wang: ¶ 61-65, 85, 86; Fig 1: a virtual personal assistant operable to determine user intent, user emotional state and voice characteristics based on analysis of input user speech and other user parameters); 
converting the audio data into a linguistic representation comprising words or phonemes corresponding to the recorded speech (Wang: ¶ 61-65, 85, 86, 213-216; Fig 1: system accepts audio input and determines words therein, vocal characteristics thereof); 
receiving non-linguistic data associated with at least one of the one or more speakers (id.  e.g. the determination of voice biometrics, characteristics etc. as well as facial expressions etc.); 
via a sensor-fusion machine learning model trained previously for topic identification, detection, etc., concurrently processing the words or phonemes of the linguistic representation and the non-linguistic data to detect a topic corresponding to the linguistic representation (Wang: ¶ 85, 111-115, 175-183, 213-216, 222, 223, 239-242, 245-256; Fig 5, 10, 18: multi-modal interpretation component identifies a speaker and determines intent thereof, the determination of intent utilizing linguistic components of the speech such as speech audio, passphrase, etc. as well as non-linguistic components of the speech such as voice biometrics, statistics, and/or other voice characteristics, components, parameters based on pre trained voice recognition models, the determination of intent encompassing particular topics, learned rules based thereon and user relations thereto); and 
committing language data based on the linguistic representation, the language data identifying the topic detected (id. And ¶ 3, 4, 19-21, etc.: intent determined with respect to a user command or other intent upon structures such as that of Analyzer 1800 which provides, outputs, etc. speaker identification and input speech content intent to other devices and/or systems by output of a command and speaker determination at 1838, said command and speaker data stored or persisted in the computer system at least in to form of output to a buffer, memory, etc.).

It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to augment the Kal taught data structure to accommodate the various modalities (facial, voiceprint, body language, ontology, etc. recognition) of the Wang system and method including the determination, identification of a speaker, and/or an intent, topic, etc. thereof. The average skilled practitioner would have been motivated to do so for at least the purpose of: optimizing particular signal processing or system operations with respect to an identified speaker, performing commands based thereon, etc. and would have expected only predictable results therefrom.

Kal in view of Wang can be considered to strongly suggest the recited subject matter in as much as the speaker dependent and speaker adapted models of Wang necessarily bear identifying data of the particular speaker upon which the model depends (see at least Wang ¶ 220-223, etc.) and by which a particular model may be invoked based on input data including audio input analysis, video input analysis and/or tactile analysis of a user input said input data operative to perform command and speaker determinations with regard to a specific identified user (Wang: ¶ 191-198, 237, 252-256, etc.; fig 11, 18, etc.) wherein said model determines a user intent which is considered to meet the broadest reasonable interpretation of the recited topic (that is, something about which a user may ask, request or otherwise instruct a computer to accomplish). As such, Kal in view of Wang implicitly associates a speaker dependent and/or speaker adaptive model with a particular identified speaker and intent, topic, etc. thereof but does not explicitly teach the committing, storing, saving, etc. to a data structure language data identifying at the topic detected, merely reading and output of such data if extant. Thus Kal in view of Wang is not considered to explicitly recite recording topic data in a data structure, Kal in view of Wang at best shows output of such data (see Wang: figures 5, 10 etc.)

In a related field of endeavor Kim teaches a system and method operable for receiving audio data recording speech from one or more speakers (Kim: Abstract; ¶ 79-81, Fig 8: system receives speech audio of plural users and determines speech data thereof such as linguistic representations in the form of a speaker identity, topic, target, etc.); 
converting the audio data into a linguistic representation comprising words or phonemes corresponding to the recorded speech (Kim: ¶ 65-68, 70-73; Fig 3, 5: system receives speech of plural users, extracts voice data, user identification etc. thereof, as well as a user contextual data including gaze, activity, etc. and thereby encodes speaker identity, participant id, etc.); 
receiving non-linguistic data associated with at least one of the one or more speakers (Kim: 65-73; Fig 3-5: system determines plural parameters of contextual data for the plurality of users including gaze direction, participation dynamics, locations, meeting topic, etc.);  
 and recording language data with respect to the  topic in the data structure, wherein the language data is based upon the linguistic representation (Kim: ¶ 65-68, 70-73; Fig 3, 5: user profiles identifying a user and comprising language data based on speaker identification, speaking dynamics and context, meeting topic et. stored, updated, maintained, etc. in a data structure).

It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to adopt well-known data structures comprising linguistic, non-linguistic and other user identifying data such as those utilized by Kim within the Kal in view of Wang system and method. The average skilled practitioner would have been motivated to do so for at least the purpose of persisting, communicating, etc. determined data such as the Kal emotional state data; Wang speaker recognition and intent data; Kim speaker, topic and target data etc. and would have expected only predictable results therefrom. It would have been further obvious to one of ordinary skill in the art before the effective filing date of the instant application to adapt the Kal in view of Wang machine learning algorithms to accommodate the Kim data as an input. The average skilled practitioner would have been motivated to do so for the purpose of processing multi-modal data in an analysis system and would have expected only predictable results therefrom.


In re Claim 12, Kal in view of Wang in view of Kim teaches or suggests wherein converting the audio data includes filtering based on semantic comparison of the linguistic representation against the detected topic (Kal: cols. 3-4: ll. 10-56 and cols. 9-10: ll. 62-48; see also col. 13: ll. 53-57); (Wang: ¶ 111-115, 157, 213-216, 222, 223, 239-242, 247, 300, 325).  The claim is considered obvious over Kal as modified by Wang and Kim as addressed in the base claim as it would have been obvious to apply the further teaching of Kal, Wang, and/or Kim to the modified device of Kal, Wang and Kim.

In re Claim 13, Kal in view of Wang in view of Kim teaches or suggests wherein the topic is detected in a trained machine-learning module by semantic analysis of the linguistic representation  (Kal: see col. 4: ll. 15-32: linguistic and semantic context analyzed); (Wang: ¶ 111-113, 202,  213-223, 239-247, 302: system identifies a user based on voice biometrics, etc. the biometrics operable in concert with an understanding and interpretation components and operable to determine a particular user as well as the contains on said user in operation of the system); (Kim: ¶ 68-71, 75-81; Fig 4, 8: system determines and encodes topical data). The claim is considered obvious over Kal as modified by Wang and Kim as addressed in the base claim as it would have been obvious to apply the further teaching of Kal, Wang, and/or Kim to the modified device of Kal, Wang and Kim.

In re Claim 14, Kal in view of Wang in view of Kim teaches or suggests further comprising identifying a speech target corresponding to the linguistic representation (Kal: cols. 3-4: ll. 10-56 and cols. 9-10: ll. 62-48); (Wang: ¶ 63, 199, 201, etc.); (Kim: ¶ 68-71, 75-81; Fig 4, 8: system determines topical data and a conversational target as well as a device of each/any participant including a target). It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to augment the Kal in view of Wang in view of Kim system, method, etc. to include the Wang taught method to resolve a particular target such as that determined using the linguistic, image, etc. analysis of Wang. The average skilled practitioner would have been motivated to do so for the purpose of combining known elements to achieve known results and would have expected only predictable results therefrom.

In re Claim 15, Kal in view of Wang in view of Kim teaches or suggests wherein the speech target is identified based on posture analysis (Kal: col. 3: ll. 46-51: recognition of body position and motion); (Wang: ¶ 63, 199, 201, etc.: the system operates to perform pose analysis and image analysis to determine a potential target). It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to utilize the Wang taught posture and pose analyses to resolve a particular target such as that determined using the image analysis of Wang in the Kal, wang and Kim system, method, etc. The average skilled practitioner would have been motivated to do so for the purpose of combining known elements to achieve known results and would have expected only predictable results therefrom.

In re Claim 16, Kal in view of Wang in view of Kim teaches or suggests wherein the speech target is identified based on facial recognition (see col. 3: ll. 46-51); (Wang: ¶ 111-113, 202,  213-223, 239-247: a speaker recognition engine includes facial recognition engine 1322). The claim is considered obvious over Kal as modified by Wang and Kim as addressed in the base claim and claim 15 supra as it would have been obvious to apply the further teaching of Kal, Wang, and/or Kim to the modified device of Kal, Wang and Kim.

In re Claim 17, Kal in view of Wang in view of Kim teaches or suggests wherein the speech target includes the computer system (Kal: FIGS. 7-8 and cols. 10-13: ll. 49-57); (Wang: ¶ 111-115, 157, 213-216, 222, 223, 239-242, 247, 300, 325: the speech target determination includes the computer system as well as indicating the computer system of a target, such as the phone thereof); (Kim: ¶ 68-71, 75-81; Fig 4, 8: system determines a topic and a conversational target as well as a device of each/any participant including a target). It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to utilize the Wang taught computer system on a plurality of computer systems of a plurality of users of the Kal in view of Wang in view of system and method. The average skilled practitioner would have been motivated to do so for the purpose of  distributing tools whereby users might communicate over distances, and/or across geographical, language, contextual, etc. diversities and would have expected only predictable results therefrom.

In re Claim 18, Kal in view of Wang in view of Kim teaches or suggests further comprising backfilling previously unresolved linguistic elements of the data structure based on the identified speech target (Kal  col. 3: ll. 10-33, col. 4: ll. 15-32, cols. 9-10: ll. 17-48, col. 13: ll. 53-57: the system maintains a dictionary based on recognized and/or disambiguated speech); (Wang: ¶ 403, 416, 426, etc.: system resolve ambiguities, requests disambiguation, etc.). It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to utilize the Kal, Wang, or Kim taught methods of disambiguation to maintain the Kal in view of Wang in view of Kim dictionary, data structure etc. The average skilled practitioner would have been motivated to do so for the purpose of  backfilling a dictionary, data structure, etc. based on learned words, contexts, etc. and would have expected predictable results therefrom.

In re Claim 19, Kal in view of Wang in view of Kim teaches or suggests further comprising backfilling previously unresolved linguistic elements of the data structure based on the detected topic (see col. 3: ll. 10-33, col. 4: ll. 15-32, cols. 9-10: ll. 17-48, col. 13: ll. 53-57); (Wang: ¶ 111-115, 157, 213-216, 222, 223, 239-242, 247, 300, 325, 403, 416, 426). The claim is considered obvious over Kal as modified by Wang and Kim as addressed in the base claim and claim 18 supra as it would have been obvious to apply the further teaching of Kal, Wang, and/or Kim to the modified device of Kal, Wang and Kim.

In re Claim 20, Kal discloses a method, data structure, etc. to store semantically resolved language data (Kal: see FIGS. 1A and 1D; col. 2: ll. 9-39; and cols. 2-3: ll. 40-9) in a data structure in a computer system (Kal: see FIGS. 7-8 and cols. 10-13: ll. 49-57: one or more context features 105, acoustic features 107, visual features 109, linguistic features 111, and physical features 113 of user may be derived from signals obtained by one or more sensors 102 t determine emotional state of a user(s) during issue of instructions to a computer and augment the response of the computer thereby), and thereby to execute computer-actionable directives in concert with information conveyed in human speech (Kal: FIGS. 1A, 1D; cols. 3-4: ll. 10-56;  cols. 9-10: ll. 17-48, col. 13, ll. 15-58: system operates to maintain a user state and optionally execute signal processing, speech recognition, etc. instructions and update a data structure based on determined state and other input data), the method comprising: 
receiving audio data recording speech from one or more speakers (Kal: see FIG. 1A: via sensors 102 and cols. 3-4: ll. 10-56); 
converting the audio data into a linguistic representation comprising words or phonemes corresponding to the recorded speech (Kal: see FIG. 1A: linguistic features 111 and cols. 3-4: ll. 10-56);  
receiving non-linguistic data (Kal: see FIG. 1A: context features 105, acoustic features 107, visual features 109, physical features 113) associated with at least one of the one or more speakers (see cols. 3-4: ll. 10-56); 
detecting a targeted emotion class corresponding to the linguistic representation (Kal: cols. 3-4: ll. 10-56 and cols. 9-10: ll. 62-48); via a sensor-fusion machine-learning model trained previously for target emotion identification (Kal: cols. 2: ll. 32-47; 7: ll. 34-61; 9-10: ll. 62-48; 13: ll. 53-57: multimodal analysis system uses linguistic cues such as the meaning of words and other contextual topics such as game state, play, etc. to determine, classify, etc. an  underlying user emotional state and the relation of the emotional state to the meaning of the utterance), 
to identify the target state by concurrently processing the words or phonemes of the linguistic representation and the non-linguistic data (Kal: id. and see also col. 7: ll. 34-40; FIG. 1A, 1D: via 112’) to identify a target emotion corresponding to the linguistic representation (Kal: cols. 2: ll. 32-47; 7: ll. 34-61; 9-10: ll. 62-48; 13: ll. 53-57); 
recording language data based on the linguistic representation to a data structure,  (Kal: id.: via determined emotional state 115 and/or change state 110 and in concert with linguistic and semantic data).

Kal does not explicitly teach detecting, identifying, etc. a target computer corresponding to the linguistic representation via a sensor-fusion machine-learning model trained previously for target identification; to thereby identify a target computer from among plural targets of the recorded speech corresponding to the linguistic representation; committing to the data structure language data associated with the identified target computer and based on the linguistic representation; parsing the data structure to identify in the language data one or more of the computer actionable directives actionable by an identified target; and submitting the one or more directives to the target computer for processing, as such Kal does not explicitly discuss recording language data with respect to the identified target in a data structure based upon the linguistic representation, the language data identifying the at least one of the one or more speakers.

In a related field of endeavor Wang teaches
A method to execute computer-actionable directives conveyed in human speech (Wang: ¶ 61-65, 85, 86; Fig 1: a virtual personal assistant operable to determine user intent, user emotional state and voice characteristics based on analysis of input user speech and other user parameters and to execute commands therein based thereon);
the method comprising: receiving audio data recording speech from one or more speakers (Wang: id); 
converting the audio data into a linguistic representation comprising words or phonemes corresponding to the recorded speech (Wang: ¶ 61-65, 85, 86, 213-216; Fig 1: system accepts audio input and determines words therein, vocal characteristics thereof);
receiving non-linguistic data associated with at least one of the one or more speakers  (id.  e.g. the determination of voice biometrics, statistics, characteristics etc. as well as facial expressions etc.); via a sensor-fusion machine-learning model trained previously for target identification (id.  e.g. the determination of voice biometrics, characteristics etc. as well as facial expressions etc.),  concurrently processing the words or phonemes of the linguistic representation and the non-linguistic data (Wang: ¶ 85, 111-113, 213-216, 222, 223, 239-242, 245-256; Fig 18: multi-modal interpretation component identifies a speaker and determines intent thereof, the determination of intent utilizing linguistic components of the speech such as speech audio, passphrase, etc. as well as non-linguistic components of the speech such as voice biometrics, statistics and/or other voice characteristics, components, parameters, etc. based on pre trained voice recognition models, etc. the determination of intent encompassing particular topics, learned rules based thereon and user relations thereto and operative to instruct the computer to perform particular commands)
to identify a target computer corresponding to the linguistic representation  (Wang: ¶ 63, 113, 199, 201, 424, 473, 528-534, etc.: system operates to determine an object, person, or other target of a command intention a string of sounds such as “please call John”, executes a command with respect to the target computer, phone, etc. of John and/or determining the target of a term “him” in the command “call him” or to access operations of a service robot) from among plural targets of the recorded speech (id. the computer, phone, etc. target corresponding to the target John, him, etc. and/or access particular target operations of a target service robot); 
committing to a data structure language data associated with the identified target computer thereof, and based on the linguistic representation (Wang: ¶ 85, 111-113, 213-216, 222, 223, 239-242, 245-256, 528-534; Fig 18: system accesses previously trained recognition models with respect to an identified, determined, intent, topics, etc. processes user input language based thereon and updates a structure in memory therewith in this way the system issues a directive to the target computer of John or him to ring, and/or issues a directive to a service robot to perform one or more directives borne in the user language data); parsing the data structure to identify in the language data one or more directives actionable by the target computer (Wang: ¶ 199, 424, 473, 528-534: system performs an action with respect to a recognized target and computer associated therewith such as receiving a phone call; activating functionality of a service robot, etc.); 
and submitting the one or more directives to the target computer for processing (Wang: 528-534: spoken commands operate particular functionality on the service robot).

It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to augment the Kal taught data structure to accommodate the various modalities (facial, voiceprint, body language, ontology, target, etc. recognition, resolution, etc.) of the Wang system and method including the determination of intent, topic, etc., the determination of a command object or target and the execution of an instruction based thereon. The average skilled practitioner would have been motivated to do so for at least the purpose of: optimizing particular signal processing with respect to executing an intent of an identified speaker, with respect to an object, target, etc. thereof and would have expected only predictable results therefrom.

Kal in view of Wang does not explicitly teach committing language data based on the linguistic representation to a data structure, the language data being associated with the target computer.

In a related field of endeavor Kim teaches a system and method operable for receiving audio data recording speech from one or more speakers (Kim: Abstract; ¶ 79-81, Fig 8: system receives speech audio of plural users and determines speech data thereof such as linguistic representations in the form of a speaker identity, topic, target, etc.); 
converting the audio data into a linguistic representation comprising words or phonemes corresponding to the recorded speech (Kim: ¶ 65-68, 70-73; Fig 3, 5: system receives speech of plural users, extracts voice data, user identification etc. thereof, as well as a user contextual data including gaze, activity, etc. and thereby encodes speaker identity, participant id, etc.); 
receiving non-linguistic data associated with at least one of the one or more speakers (Kim: 65-73; Fig 3-5: system determines plural parameters of contextual data for the plurality of users including gaze direction, participation dynamics, locations, meeting topic, etc.) including data of a target of user attention in concert with data identifying user attention thereto (Kim: ¶ 68-71, 75-81; Fig 4, 8: system determines topical data and a conversational target as well as a device of each/any participant including a target);  
 and recording language data with respect to the  identified target in the data structure, wherein the language data is based upon the linguistic representation (Kim: ¶ 65-68, 70-73; Fig 3, 5: user profiles identifying a user and comprising language data based on speaker identification, speaking dynamics and context, meeting topic et. stored, updated, maintained, etc. in a data structure).

It would have been obvious to one of ordinary skill in the art before the effective filing date of the instant application to adopt well-known data structures comprising linguistic, non-linguistic and other user identifying data such as attentional target identifiers such as those utilized by Kim within the Kal in view of Wang system and method. The average skilled practitioner would have been motivated to do so for at least the purpose of persisting, communicating, etc. determined data such as the Kal emotional state data, Wang speaker recognition, intent and target computer data, Kim speaker, topic and target user data etc. and would have expected only predictable results therefrom. It would have been further obvious to one of ordinary skill in the art before the effective filing date of the instant application to adapt the Kal in view of Wang machine learning algorithms to accommodate the Kim data as an input. The average skilled practitioner would have been motivated to do so for the purpose of processing multi-modal data in an analysis system and would have expected only predictable results therefrom.
 
Response to Arguments
Applicant’s arguments, see Claims and Remarks, filed 9/19/22, with respect to the rejection(s) of claim(s) 1, 11, 20 under  35 USC 103 over Kalini-Akbacak in view of Wang in view of Wheeler have been fully considered and are persuasive.  Therefore, the rejection has been withdrawn.  However, upon further consideration, a new ground(s) of rejection is made in view of Kalini-Akbacak, Wang and Kim. Further Examiner has addressed any considerations raised by Applicant’s arguments in the art rejection supra. No claims currently stand allowable.

Conclusion

The prior art made of record and not relied upon is considered pertinent to applicant's disclosure. 
	20140249817 voice recognition system and data structure for establishing, encoding, etc. a user identity, intent, topic, etc.

Any inquiry concerning this communication or earlier communications from the examiner should be directed to PAUL C MCCORD whose telephone number is (571)270-3701. The examiner can normally be reached 730-630 M-F.
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, VIVIAN CHIN can be reached on 5712727848. The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of published or unpublished applications may be obtained from Patent Center. Unpublished application information in Patent Center is available to registered users. To file and manage patent submissions in Patent Center, visit: https://patentcenter.uspto.gov. Visit https://www.uspto.gov/patents/apply/patent-center for more information about Patent Center and https://www.uspto.gov/patents/docx for information about filing in DOCX format. For additional questions, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.





/PAUL C MCCORD/Primary Examiner, Art Unit 2654