DETAILED ACTION
Status of Claims
 	Claims 1-20 are pending in this application, with claims 1, 9 and 13 being independent.
Notice of AIA  Status
 	The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA .
 	In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status. 
Obligation Under 37 CFR 1.56 – Joint Inventors
 	This application currently names joint inventors. In considering patentability of the claims the examiner presumes that the subject matter of the various claims was commonly owned as of the effective filing date of the claimed invention(s) absent any evidence to the contrary.  Applicant is advised of the obligation under 37 CFR 1.56 to point out the inventor and effective filing dates of each claim that was not commonly owned as of the effective filing date of the later invention in order for the examiner to consider the applicability of 35 U.S.C. 102(b)(2)(C) for any potential 35 U.S.C. 102(a)(2) prior art against the later invention.
Drawings
	 The drawings were received on December 6, 2021.  These drawings are acceptable.
 Claim Objections
 	Claim 6 is objected to because of the following informalities: line 2 of claim 6 recites, “sharing an embedded link to a plurality of users via a network.”  However, this language is vague and indefinite since it can have two different meanings.  It can either mean: (1) an embedded link to a plurality of users is shared via a network (i.e., an embedded link to a plurality of users), or (2) via a network, sharing an embedded link, via a network, with a plurality of users (i.e., sharing, with a plurality of users, an embedded link).   Appropriate correction is required.
  	For the purpose of further examining claim 6 at this time, the examiner will interpret line 2 of claim 6 as meaning: “sharing, with a plurality of users, an embedded link, wherein the embedded link is shared via a network.”
	Claim 13 is objected to because of the following informalities: 
line 4 of claim 13 recites, “embedding a link to the web browser of the user device”  Appropriate correction is required.
lines 7-8 of claim 13 recite, “transmitting a stream of data from the application representing information to the web browser to generate the virtual character;”  however, as recited, the meaning is confusing and not entirely clear.  For the purpose of further examining claim 13 at this time, the examiner has interpreted lines 7-8 as meaning: “transmitting, from the application to the web browser, a stream of data representing information to generate the virtual character.”   Appropriate correction is required.
“the threshold number of similar features” (lines 21-22 of claim 13) lacks proper antecedent basis.   As per lines 16-17 of claim 13 (which recites, “a threshold similarity”), there is only proper antecedent basis for “the threshold similarity”.   Appropriate correction is required.
Claim Rejections - 35 USC § 102
 	The following is a quotation of the appropriate paragraphs of 35 U.S.C. 102 that form the basis for the rejections under this section made in this Office action:
A person shall be entitled to a patent unless –

(a)(1) the claimed invention was patented, described in a printed publication, or in public use, on sale or otherwise available to the public before the effective filing date of the claimed invention.

(a)(2) the claimed invention was described in a patent issued under section 151, or in an application for patent published or deemed published under section 122(b), in which the patent or application, as the case may be, names another inventor and was effectively filed before the effective filing date of the claimed invention.

 	Claim 9, 11 and 12 are rejected under 35 U.S.C. 102 (a)(2) as being anticipated by SHUKLA et al. (US 2019/0279642, hereinafter “SHUKLA”).
	Regarding claim 9, SHUKLA discloses a device (¶ [0010]: “a method, implemented on a machine having at least one processor, storage,”  ¶ [0011]: “a system”;  ¶ [0136]: “computing device architecture that may be used to realize a specialized system”;  ¶ [0137]: “Computer 2400”) configured to provide a response to a multi-modal input relating to a user captured by the device (¶ [0011]: “a system for user machine dialogue. The system includes a sensor data collection unit configured for receiving an audio signal representing a speech of a user engaged in a dialogue and a visual signal capturing the user uttering the speech, an audio based speech recognition unit configured for obtaining a first speech recognition result by performing audio based speech recognition based on the audio signal, a lip reading based speech recognizer configured for detecting lip movement of the user based on the visual signal and obtaining a second speech recognition result by performing lip reading based speech recognition, and an audio-visual speech recognition integrator configured for generating an integrated speech recognition result based on the first and second speech recognition results.”), the device comprising: 
 	at least one memory (¶ [0010]: “a method, implemented on a machine having at least one processor, storage,”  ¶ [0137]: “program storage and data storage of different forms (e.g., disk 2470, read only memory (ROM) 2430, or random access memory (RAM) 2440), for various data files to be processed and/or communicated by computer 2400, as well as possibly program instructions to be executed by CPU 2420.”) including: 
 	at least two internal models (e.g., ¶ [0042]: “detecting a spoken language based on multiple model based speech recognition”; ¶ [0094]: “sound models 715”; ¶ [0094]: “speech lip movement models 725“; ¶ [0106]: “a lip detection model 930.” ¶ [0110]: “based on, e.g., models 1120 that characterize human speech sound.”  ¶ [0110]: “models that can be used to detect different types of sound in the dialogue scene.“  ¶ [0126]: “the lip shape/sound model(s)”) configured to identify characteristics from multi-modal input information (e.g., ¶ [0110]: “to detect, from the input audio data, sounds that likely correspond to human speech activities based on, e.g., models 1120 that characterize human speech sound.” ¶ {0110]: “models that can be used to detect different types of sound in the dialogue scene.”   ¶ [0095]: “audio cues that reveal human speech activities and video cues related to lip movement that evidences human speech”) (¶ [0094]: “The audio based sound source estimator 710 processes audio data collected from a dialogue scene and estimates one or more sound sources (for speech) based on sound models 715 (e.g., acoustic models for human speech). The visual based sound source estimator 720 is provided for estimating one or more candidate sources (directions in a dialogue scene) of speech activities in a dialogue scene based on visual cues. The visual based sound source estimator 720 processes image data collected from the dialogue scene, analyzes the visual information based on speech lip movement models 725 (e.g., visual models for lip movement in speech in certain languages), and estimates candidate sound source(s) where the human speech is occurring. The audio based sound source candidates estimated by the audio based sound source estimator 710 and the visual based sound source estimates from 720 are sent, respectively, to the sound source disambiguation unit 730 so that the estimated sound candidates determined based on different cues may be disambiguated to generate estimated source(s) of sound in a dialogue environment.” ¶ [0095]: “an integrated approach by combining audio and video cues, including audio cues that reveal human speech activities and video cues related to lip movement that evidences human speech. In operation, the visual based sound source estimator 720 receives, at 702 of FIG. 7B, image (video) data acquired from the dialogue scene and processes the video data to detect, at 712, lip movement based on speech lip movement models 725 for recognizing speech activities. In some embodiments, the speech lip movement models to be used for the detection may be selected with respect to a certain language.”  ¶ [0120]: “When the audio based speech recognition unit 1530 receives the audio signals from acoustic sensor(s), it performs, at 1630, speech recognition based on speech recognition models 1540”;  ¶ [0120]: “Similarly, when the lip reading based speech recognizer 1550 receives the visual data (video), it performs, at 1650, speech recognition based on lip reading in accordance with lip reading models 1560.”  ¶ [0120]: “Thus, the lip reading based speech recognition unit 1550 performs speech recognition, at 1650, by comparing tracked lip movements (observed in the visual input data) against some lip reading model(s) appropriate for the underlying language for the speech recognition. The appropriate lip reading model may be selected (from the lip reading models 1560) based on, e.g., an input related to language choice.”  ¶ [0126]: “Mapping lip shape and/or lip movement to a sound may involve viseme analysis, where a viseme may correspond to a generic image that is used to describe a particular sound. As commonly known, a viseme may be a visual equivalent of a phoneme or acoustic speech sound in a spoken language and can be used by hearing-impaired person to view sounds visually. To derive a viseme, the analysis needed may depend on the underlying spoken language. In the present teaching, the lip shape/sound model(s) from 1960 may be used for determining sounds corresponding to lip shapes. In recognizing visemes associated with a spoken language, an appropriate lip shape/sound model may be selected according to a known current language.”    ¶ [0095]: “automatic speech recognition (ASR)”;  ¶ [0130]: “ In some embodiments, the integration may also be performed at an even lower level. For instance, the integration may be performed based on phonemes estimated based on sound (audio based) or visemes recognized based on lip reading (visual based).  FIG. 21 illustrates an exemplary scheme for integrating audio based speech recognition (ASR) and the lip reading based speech recognition, according to a different embodiment of the present teaching. As shown, speech signal is processed respectively via ASR and video data are processed via lip recognition. In some embodiments, the ASR generates phonemes and the lip reading generates visemes. To integrate the recognition results, comparison is performed between the recognition results. For example, the phonemes from the ASR are converted into visemes and similarity between the visemes converted from phonemes from the ASR and that from the lip reading are assessed. In some embodiments, if they are similar, e.g., the similarity exceeds a certain level, the recognition result from ASR is accepted because it is supported by the lip reading result. If the similarity level of the visemes from ASR and lip reading is below a set level, the visemes may be accepted but the recognition result may be associated with a low confidence score. In some embodiments, the automated dialogue companion or the agent may request the user engaged in the dialogue to speak louder so that the next round of recognition may be based on better signals. In some situations, if the similarity is low according to some criterion, the visemes may not be accepted and the automated dialogue companion may react to the situation by letting the user know that what is spoken cannot be discerned and ask the user to say it again.”); 
 	a virtual character knowledge model (e.g., ¶ [0069]: “a dialogue tree of an on-going dialogue”)  including information specific to a virtual character (e.g., ¶ [0069]: “paths which may be taken depending on a response detected from a user”;  NOTE: In other words, the recognized speech (which includes the particular phonemes and visemes) of each response from a user is compared against paths in the paths in the dialogue tree.) (¶ [0069]: “FIG. 4B illustrates a part of a dialogue tree of an on-going dialogue with paths taken based on interactions between the automated companion and a user, according to an embodiment of the present teaching. In this illustrated example, the dialogue management at layer 3 (of the automated companion) may predict multiple paths with which a dialogue, or more generally an interaction, with a user may proceed. In this example, each branch from a node may represent possible responses from a user. As shown in this example, at node 1, the automated companion may face with three separate paths which may be taken depending on a response detected from a user. If the user responds with an affirmative response, dialogue tree 400 may proceed from node 1 to node 2. At node 2, a response may be generated for the automated companion in response to the affirmative response from the user and may then be rendered to the user, which may include audio, visual, textual, haptic, or any combination thereof.”  ¶ [0070]: “If, at node 1, the user responses negatively, the path is for this stage is from node 1 to node 10. If the user responds, at node 1, with a “so-so” response (e.g., not negative but also not positive), dialogue tree 400 may proceed to node 3, at which a response from the automated companion may be rendered and there may be three separate possible responses from the user, “No response,” “Positive Response,” and “Negative response,” corresponding to nodes 5, 6, and 7, respectively. Depending on the user's actual response with respect to the automated companion's response rendered at node 3, the dialogue management at layer 3 may then follow the dialogue accordingly. For instance, if the user responds at node 3 with a positive response, the automated companion moves to respond to the user at node 6. Similarly, depending on the user's reaction to the automated companion's response at node 6, the user may further respond with an answer that is correct. In this case, the dialogue state moves from node 6 to node 8, etc. In this illustrated example, the dialogue state during this period moved from node 1, to node 3, to node 6, and to node 8. The traverse through nodes 1, 3, 6, and 8 forms a path consistent with the underlying conversation between the automated companion and a user. As seen in FIG. 4B, the path representing the dialogue is represented by the solid lines connecting nodes 1, 3, 6, and 8, whereas the paths skipped during a dialogue is represented by the dashed lines.”      ¶ [0084]: “In each dialogue of a certain topic, the dialogue manager 510 bases its control of the dialogue on relevant dialogue tree(s) that may or may not be associated with the topic (e.g., may inject small talks to enhance engagement). To generate a response to a user in a dialogue, the dialogue manager 510 may also consider additional information such as a state of the user, the surrounding of the dialogue scene, the emotion of the user, the estimated mindsets of the user and the agent, and the known preferences of the user (utilities).”  ¶ [0056]: “In some embodiments, such inputs from the user's site and processing results thereof may also be transmitted to the user interaction engine 140 for facilitating the user interaction engine 140 to better understand the specific situation associated with the dialogue so that the user interaction engine 140 may determine the state of the dialogue, emotion/mindset of the user, and to generate a response that is based on the specific situation of the dialogue and the intended purpose of the dialogue (e.g., for teaching a child the English vocabulary). For example, if information received from the user device indicates that the user appears to be bored and become impatient, the user interaction engine 140 may determine to change the state of dialogue to a topic that is of interest of the user (e.g., based on the information from the user information database 130) in order to continue to engage the user in the conversation.”); and 
 	a library of potential actions associated with the virtual character (e.g., in FIG. 4A, the “Database” storing “Character config,” “Voice config”;   ¶ [0068]: “In some embodiments, some information that may be shared via the internal API may be accessed from an external database. For example, certain configurations related to a desired character for the agent device (a duck) may be accessed from, e.g., an open source database, that provide parameters (e.g., parameters to visually render the duck and/or parameters needed to render the speech from the duck).”  and/or ¶ [0087]: “verbal response generation and/or behavior response generation, as depicted in FIG. 5.”  NOTE:  A databases may be reasonably interpreted as being “a library.”) to determine an action that matches the selected characteristic (e.g., ¶ [0087]: “To deliver a response, a deliverable form of the response may be generated via, e.g., verbal response generation and/or behavior response generation, as depicted in FIG. 5. Such a response in its determined deliverable form(s) may then be used by a renderer to actual render the response in its intended form(s). For a deliverable form in a natural language, the text of the response may be used to synthesize a speech signal via, e.g., text to speech techniques, in accordance with the delivery parameters (e.g., volume, accent, style, etc.). For any response or part thereof, that is to be delivered in a non-verbal form(s), e.g., with a certain expression, the intended non-verbal expression may be translated into, e.g., via animation, control signals that can be used to control certain parts of the agent device (physical representation of the automated companion) to perform certain mechanical movement to deliver the non-verbal expression of the response, e.g., nodding head, shrug shoulders, or whistle. In some embodiments, to deliver a response, certain software components may be invoked to render a different facial expression of the agent device. Such rendition(s) of the response may also be simultaneously carried out by the agent (e.g., speak a response with a joking voice and with a big smile on the face of the agent).”   ¶ [0090]: “On the output side of the processing level, when a certain response strategy is determined, such strategy may be translated into specific actions to take by the automated companion to respond to the other party. Such action may be carried out by either deliver some audio response or express certain emotion or attitude via certain gesture. When the response is to be delivered in audio, text with words that need to be spoken are processed by a text to speech module to produce audio signals and such audio signals are then sent to the speakers to render the speech as a response. In some embodiments, the speech generated based on text may be performed in accordance with other parameters, e.g., that may be used to control to generate the speech with certain tones or voices. If the response is to be delivered as a physical action, such as a body movement realized on the automated companion, the actions to be taken may also be instructions to be used to generate such body movement.”)   (¶ [0068]: “In some embodiments, through the internal API, various layers (2-5) may access information created by or stored at other layers to support the processing. Such information may include common configuration to be applied to a dialogue (e.g., character of the agent device is an avatar, voice preferred, or a virtual environment to be created for the dialogue, etc.), a current state of the dialogue, a current dialogue history, known user preferences, estimated user intent/emotion/mindset, etc. In some embodiments, some information that may be shared via the internal API may be accessed from an external database. For example, certain configurations related to a desired character for the agent device (a duck) may be accessed from, e.g., an open source database, that provide parameters (e.g., parameters to visually render the duck and/or parameters needed to render the speech from the duck).”  ¶ [0069]: “At node 2, a response may be generated for the automated companion in response to the affirmative response from the user and may then be rendered to the user, which may include audio, visual, textual, haptic, or any combination thereof.”  ¶ [0081]: “Layer 1 also handles information rendering of a response from the automated dialogue companion to a user. In some embodiments, the rendering is performed by an agent device, e.g., 160-a and examples of such rendering include speech, expression which may be facial or physical acts performed. For instance, an agent device may render a text string received from the user interaction engine 140 (as a response to the user) to speech so that the agent device may utter the response to the user. In some embodiments, the text string may be sent to the agent device with additional rendering instructions such as volume, tone, pitch, etc. which may be used to convert the text string into a sound wave corresponding to an utterance of the content in a certain manner. In some embodiments, a response to be delivered to a user may also include animation, e.g., utter a response with an attitude which may be delivered via, e.g., a facial expression or a physical act such as raising one arm, etc. In some embodiments, the agent may be implemented as an application on a user device. In this situation, rendering of a response from the automated dialogue companion is implemented via the user device, e.g., 110-a (not shown in FIG. 5).”   ¶ [0083]: “The multimodal data understanding generated at layer 2 may be used by DM 510 to determine how to respond. To enhance engagement and user experience, the DM 510 may also determine a response based on the estimated mindsets of the user and of the agent from layer 4 as well as the utilities of the user engaged in the dialogue from layer 5. The mindsets of the parties involved in a dialogue may be estimated based on information from Layer 2 (e.g., estimated emotion of a user) and the progress of the dialogue. In some embodiments, the mindsets of a user and of an agent may be estimated dynamically during the course of a dialogue and such estimated mindsets may then be used to learn, together with other data, utilities of users. The learned utilities represent preferences of users in different dialogue scenarios and are estimated based on historic dialogues and the outcomes thereof.”   ¶ [0084]: “In each dialogue of a certain topic, the dialogue manager 510 bases its control of the dialogue on relevant dialogue tree(s) that may or may not be associated with the topic (e.g., may inject small talks to enhance engagement). To generate a response to a user in a dialogue, the dialogue manager 510 may also consider additional information such as a state of the user, the surrounding of the dialogue scene, the emotion of the user, the estimated mindsets of the user and the agent, and the known preferences of the user (utilities).”  ¶ [0085]: “An output of DM 510 corresponds to an accordingly determined response to the user. To deliver a response to the user, the DM 510 may also formulate a way that the response is to be delivered. The form in which the response is to be delivered may be determined based on information from multiple sources, e.g., the user's emotion (e.g., if the user is a child who is not happy, the response may be rendered in a gentle voice), the user's utility (e.g., the user may prefer speech in certain accent similar to his parents'), or the surrounding environment that the user is in (e.g., noisy place so that the response needs to be delivered in a high volume). DM 510 may output the response determined together with such delivery parameters.”  ¶ [0086]: “In some embodiments, the delivery of such determined response is achieved by generating the deliverable form(s) of each response in accordance with various parameters associated with the response. In a general case, a response is delivered in the form of speech in some natural language. A response may also be delivered in speech coupled with a particular nonverbal expression as a part of the delivered response, such as a nod, a shake of the head, a blink of the eyes, or a shrug. There may be other forms of deliverable form of a response that is acoustic but not verbal, e.g., a whistle.”  ¶ [0087]: “To deliver a response, a deliverable form of the response may be generated via, e.g., verbal response generation and/or behavior response generation, as depicted in FIG. 5. Such a response in its determined deliverable form(s) may then be used by a renderer to actual render the response in its intended form(s). For a deliverable form in a natural language, the text of the response may be used to synthesize a speech signal via, e.g., text to speech techniques, in accordance with the delivery parameters (e.g., volume, accent, style, etc.). For any response or part thereof, that is to be delivered in a non-verbal form(s), e.g., with a certain expression, the intended non-verbal expression may be translated into, e.g., via animation, control signals that can be used to control certain parts of the agent device (physical representation of the automated companion) to perform certain mechanical movement to deliver the non-verbal expression of the response, e.g., nodding head, shrug shoulders, or whistle. In some embodiments, to deliver a response, certain software components may be invoked to render a different facial expression of the agent device. Such rendition(s) of the response may also be simultaneously carried out by the agent (e.g., speak a response with a joking voice and with a big smile on the face of the agent).”),
 each action is associated with an animation to be performed by the virtual character and associated audio (¶ [0081]: “Layer 1 also handles information rendering of a response from the automated dialogue companion to a user. In some embodiments, the rendering is performed by an agent device, e.g., 160-a and examples of such rendering include speech, expression which may be facial or physical acts performed. For instance, an agent device may render a text string received from the user interaction engine 140 (as a response to the user) to speech so that the agent device may utter the response to the user. In some embodiments, the text string may be sent to the agent device with additional rendering instructions such as volume, tone, pitch, etc. which may be used to convert the text string into a sound wave corresponding to an utterance of the content in a certain manner. In some embodiments, a response to be delivered to a user may also include animation, e.g., utter a response with an attitude which may be delivered via, e.g., a facial expression or a physical act such as raising one arm, etc.”   ¶ [0086]: “In some embodiments, the delivery of such determined response is achieved by generating the deliverable form(s) of each response in accordance with various parameters associated with the response. In a general case, a response is delivered in the form of speech in some natural language. A response may also be delivered in speech coupled with a particular nonverbal expression as a part of the delivered response, such as a nod, a shake of the head, a blink of the eyes, or a shrug.”    ¶ [0087]: “To deliver a response, a deliverable form of the response may be generated via, e.g., verbal response generation and/or behavior response generation, as depicted in FIG. 5. Such a response in its determined deliverable form(s) may then be used by a renderer to actual render the response in its intended form(s). For a deliverable form in a natural language, the text of the response may be used to synthesize a speech signal via, e.g., text to speech techniques, in accordance with the delivery parameters (e.g., volume, accent, style, etc.). For any response or part thereof, that is to be delivered in a non-verbal form(s), e.g., with a certain expression, the intended non-verbal expression may be translated into, e.g., via animation, control signals that can be used to control certain parts of the agent device (physical representation of the automated companion) to perform certain mechanical movement to deliver the non-verbal expression of the response, e.g., nodding head, shrug shoulders, or whistle. In some embodiments, to deliver a response, certain software components may be invoked to render a different facial expression of the agent device. Such rendition(s) of the response may also be simultaneously carried out by the agent (e.g., speak a response with a joking voice and with a big smile on the face of the agent).”); and 
 	at least one processor (¶ [0137]: “Computer 2400 also includes a central processing unit (CPU) 2420, in the form of one or more processors, for executing program instructions.”) configured to: 
 	receive multi-modal input information (¶ [0046]: “continuously monitored multimodal data indicative of the surrounding of the dialogue, adaptively estimating the mindset/emotion/intent of the participants of the dialogue, and adaptively adjust the conversation strategy based on the dynamically changing information/estimates/contextual information.”) from a device (¶ [0054]: “communication may involve data in multiple modalities such as audio, video, text, etc. Via a user device, a user can send data (e.g., a request, audio signal representing an utterance of the user, or a video of the scene surrounding the user) and/or receive data (e.g., text or audio response from an agent device). In some embodiments, user data in multiple modalities, upon being received by an agent device or the user interaction engine 140, may be analyzed to understand the human user's speech or gesture so that the user's emotion or intent may be estimated and used to determine a response to the user.”  ¶ [0058]: “In some embodiment, the multi-modal sensor data may first be processed on the user device and important features in different modalities may be extracted and sent to the user interaction engine 140 so that dialogue may be controlled with an understanding of the context. In some embodiments, the raw multi-modal sensor data may be sent directly to the user interaction engine 140 for processing.”  ¶ [0066]: “multi-modal data may be acquired via sensors such as camera, microphone, keyboard, display, speakers, chat bubble, renderer, or other user interface elements.”   ¶ [0119]: “the audio based speech recognition unit 930 and the lip reading based speech recognizer 950 may respectively receive audio and visual signals as input”)
 including at least one of
 speech information (Abstract: “An audio signal is received that represents a speech of a user engaged in a dialogue.  A visual signal is received that captures the user uttering the speech.  A first speech recognition result is obtained by performing audio based speech recognition based on the audio signal. Based on the visual signal, lip movement of the user is detected and a second speech recognition result is obtained by performing lip reading based speech recognition.”    ¶ [0058]: “the user device 110-a may continually collect multi-modal sensor data related to the user and his/her surroundings, which may be analyzed to detect any information related to the dialogue and used to intelligently control the dialogue in an adaptive manner.”  ¶ [0071]: “To manage the dialogue, the automated companion may activate different sensors during the dialogue to make observation of the user and the surrounding environment. For example, the agent device may acquire multi-modal data about the surrounding environment where the user is in. Such multi-modal data may include audio, visual, or text data. For example, visual data may capture the facial expression of the user. The visual data may also reveal contextual information surrounding the scene of the conversation. For instance, a picture of the scene may reveal that there is a basketball, a table, and a chair, which provides information about the environment and may be leveraged in dialogue management to enhance engagement of the user. Audio data may capture not only the speech response of the user but also other peripheral information such as the tone of the response, the manner by which the user utters the response, or the accent of the user.”  ¶ [0110]: “The speech sound detector 1110 is provided to detect, from the input audio data, sounds that likely correspond to human speech activities based on, e.g., models 1120 that characterize human speech sound.”  ¶ [0120]: “Acoustic input data acquired by selected acoustic sensor(s) may then be sent to the audio based speech recognition unit 1530 for speech recognition based on audio data.”),
 facial expression information (Abstract: “A visual signal is received that captures the user uttering the speech.”  Abstract: “Based on the visual signal, lip movement of the user is detected and a second speech recognition result is obtained by performing lip reading based speech recognition.”    ¶ [0062]: “The automated companion may use a camera (320) to observe the user's presence, facial expressions, direction of gaze, surroundings, etc.”  ¶ [0071]: “To manage the dialogue, the automated companion may activate different sensors during the dialogue to make observation of the user and the surrounding environment. For example, the agent device may acquire multi-modal data about the surrounding environment where the user is in. Such multi-modal data may include audio, visual, or text data. For example, visual data may capture the facial expression of the user. The visual data may also reveal contextual information surrounding the scene of the conversation. For instance, a picture of the scene may reveal that there is a basketball, a table, and a chair, which provides information about the environment and may be leveraged in dialogue management to enhance engagement of the user. Audio data may capture not only the speech response of the user but also other peripheral information such as the tone of the response, the manner by which the user utters the response, or the accent of the user.”  ¶ [0120]: “Visual input data acquired by selected visual sensor may then be sent to the lip reading based speech recognition unit 1550 for speech recognition based on visual data.”  ¶ [0120]: “the lip reading based speech recognition unit 1550 performs speech recognition, at 1650, by comparing tracked lip movements (observed in the visual input data) against some lip reading model(s) appropriate for the underlying language for the speech recognition.”),
 and environmental information representing an environment (¶ [0051]: “During a conversation, an agent device may include one or more input mechanisms (e.g., cameras, microphones, touch screens, buttons, etc.) that allow the agent device to capture inputs related to the user or the local environment associated with the conversation. Such inputs may assist the automated companion to develop an understanding of the atmosphere surrounding the conversation (e.g., movements of the user, sound of the environment) and the mindset of the human conversant (e.g., user picks up a ball which may indicates that the user is bored) in order to enable the automated companion to react accordingly and conduct the conversation in a manner that will keep the user interested and engaging.”   ¶ [0055]: “Knowing that the surrounding environment of the user (based on visual information from the user device) shows green trees and lawns,”  ¶ [0058]: “the user device 110-a may continually collect multi-modal sensor data related to the user and his/her surroundings, which may be analyzed to detect any information related to the dialogue and used to intelligently control the dialogue in an adaptive manner.”   ¶ [0071]: “To manage the dialogue, the automated companion may activate different sensors during the dialogue to make observation of the user and the surrounding environment. For example, the agent device may acquire multi-modal data about the surrounding environment where the user is in. Such multi-modal data may include audio, visual, or text data. For example, visual data may capture the facial expression of the user. The visual data may also reveal contextual information surrounding the scene of the conversation. For instance, a picture of the scene may reveal that there is a basketball, a table, and a chair, which provides information about the environment and may be leveraged in dialogue management to enhance engagement of the user. Audio data may capture not only the speech response of the user but also other peripheral information such as the tone of the response, the manner by which the user utters the response, or the accent of the user.”   ¶ [0110]: “In some embodiments, depending on application needs, it is possible to also detect other types of sounds such as environmental sounds (beach, street, sports center, etc.), special event sounds (explosion, fire alarm, alerts, etc.). In this case, the 1120 may also include models that can be used to detect different types of sound in the dialogue scene.”); 
 	inspect the characteristics identified by the at least two internal models to determine (¶ [0130]: “similarity between the visemes converted from phonemes from the ASR and that from the lip reading are assessed.”) whether a first identified characteristic (e.g., ¶ [0130]: “phonemes estimated based on sound (audio based)”; ¶ [0130]: “ASR generates phonemes”) is within a threshold similarity (¶ [0130]: “the phonemes from the ASR are converted into visemes and similarity between the visemes converted from phonemes from the ASR and that from the lip reading are assessed.”  ¶ [0130]: “if they are similar, e.g., the similarity exceeds a certain level,”  ¶ [0130]: “If the similarity level of the visemes from ASR and lip reading is below a set level”) to a second identified characteristic (¶ [0130]: “visemes recognized based on lip reading (visual based);  ¶ [0130]: “the lip reading generates visemes”) (¶ [0130]: “In some embodiments, the integration may also be performed at an even lower level. For instance, the integration may be performed based on phonemes estimated based on sound (audio based) or visemes recognized based on lip reading (visual based).  FIG. 21 illustrates an exemplary scheme for integrating audio based speech recognition (ASR) and the lip reading based speech recognition, according to a different embodiment of the present teaching. As shown, speech signal is processed respectively via ASR and video data are processed via lip recognition. In some embodiments, the ASR generates phonemes and the lip reading generates visemes. To integrate the recognition results, comparison is performed between the recognition results. For example, the phonemes from the ASR are converted into visemes and similarity between the visemes converted from phonemes from the ASR and that from the lip reading are assessed. In some embodiments, if they are similar, e.g., the similarity exceeds a certain level, the recognition result from ASR is accepted because it is supported by the lip reading result. If the similarity level of the visemes from ASR and lip reading is below a set level, the visemes may be accepted but the recognition result may be associated with a low confidence score. In some embodiments, the automated dialogue companion or the agent may request the user engaged in the dialogue to speak louder so that the next round of recognition may be based on better signals. In some situations, if the similarity is low according to some criterion, the visemes may not be accepted and the automated dialogue companion may react to the situation by letting the user know that what is spoken cannot be discerned and ask the user to say it again.”); 
 	compare the first identified characteristic (claim 2: “the first speech recognition result includes a plurality of phonemes”) and the second identified characteristic (claim 2: “the second speech recognition result includes a plurality of visemes.”) (¶ [0130]: “To integrate the recognition results, comparison is performed between the recognition results. For example, the phonemes from the ASR are converted into visemes and similarity between the visemes converted from phonemes from the ASR and that from the lip reading are assessed. In some embodiments, if they are similar, e.g., the similarity exceeds a certain level, the recognition result from ASR is accepted because it is supported by the lip reading result.” ¶ [0069]: “responses from a user”  NOTE: In other words, a determined response from a user comprises the integrated speech recognition results having the identified similar phonemes and visemes.) against the virtual character knowledge model (e.g., ¶ [0069]: “a dialogue tree of an on-going dialogue”; ¶ [0069]: “paths which may be taken depending on a response detected from a user”;  NOTE: In other words, the recognized speech (which includes the particular phonemes and visemes) of each response from a user is compared against paths in the paths in the dialogue tree.) (¶ [0069]: “FIG. 4B illustrates a part of a dialogue tree of an on-going dialogue with paths taken based on interactions between the automated companion and a user, according to an embodiment of the present teaching. In this illustrated example, the dialogue management at layer 3 (of the automated companion) may predict multiple paths with which a dialogue, or more generally an interaction, with a user may proceed. In this example, each branch from a node may represent possible responses from a user. As shown in this example, at node 1, the automated companion may face with three separate paths which may be taken depending on a response detected from a user. If the user responds with an affirmative response, dialogue tree 400 may proceed from node 1 to node 2. At node 2, a response may be generated for the automated companion in response to the affirmative response from the user and may then be rendered to the user, which may include audio, visual, textual, haptic, or any combination thereof.”  ¶ [0070]: “If, at node 1, the user responses negatively, the path is for this stage is from node 1 to node 10. If the user responds, at node 1, with a “so-so” response (e.g., not negative but also not positive), dialogue tree 400 may proceed to node 3, at which a response from the automated companion may be rendered and there may be three separate possible responses from the user, “No response,” “Positive Response,” and “Negative response,” corresponding to nodes 5, 6, and 7, respectively. Depending on the user's actual response with respect to the automated companion's response rendered at node 3, the dialogue management at layer 3 may then follow the dialogue accordingly. For instance, if the user responds at node 3 with a positive response, the automated companion moves to respond to the user at node 6. Similarly, depending on the user's reaction to the automated companion's response at node 6, the user may further respond with an answer that is correct. In this case, the dialogue state moves from node 6 to node 8, etc. In this illustrated example, the dialogue state during this period moved from node 1, to node 3, to node 6, and to node 8. The traverse through nodes 1, 3, 6, and 8 forms a path consistent with the underlying conversation between the automated companion and a user. As seen in FIG. 4B, the path representing the dialogue is represented by the solid lines connecting nodes 1, 3, 6, and 8, whereas the paths skipped during a dialogue is represented by the dashed lines.”      ¶ [0084]: “In each dialogue of a certain topic, the dialogue manager 510 bases its control of the dialogue on relevant dialogue tree(s) that may or may not be associated with the topic (e.g., may inject small talks to enhance engagement). To generate a response to a user in a dialogue, the dialogue manager 510 may also consider additional information such as a state of the user, the surrounding of the dialogue scene, the emotion of the user, the estimated mindsets of the user and the agent, and the known preferences of the user (utilities).”  ¶ [0056]: “In some embodiments, such inputs from the user's site and processing results thereof may also be transmitted to the user interaction engine 140 for facilitating the user interaction engine 140 to better understand the specific situation associated with the dialogue so that the user interaction engine 140 may determine the state of the dialogue, emotion/mindset of the user, and to generate a response that is based on the specific situation of the dialogue and the intended purpose of the dialogue (e.g., for teaching a child the English vocabulary). For example, if information received from the user device indicates that the user appears to be bored and become impatient, the user interaction engine 140 may determine to change the state of dialogue to a topic that is of interest of the user (e.g., based on the information from the user information database 130) in order to continue to engage the user in the conversation.”)
 to identify a selected characteristic (¶ [0054]: “to determine a response to the user.”  ¶ [0056]: “determine the state of the dialogue, emotion/mindset of the user, and to generate a response that is based on the specific situation of the dialogue and the intended purpose of the dialogue“) (¶ [0056]: “In some embodiments, such inputs from the user's site and processing results thereof may also be transmitted to the user interaction engine 140 for facilitating the user interaction engine 140 to better understand the specific situation associated with the dialogue so that the user interaction engine 140 may determine the state of the dialogue, emotion/mindset of the user, and to generate a response that is based on the specific situation of the dialogue and the intended purpose of the dialogue (e.g., for teaching a child the English vocabulary). For example, if information received from the user device indicates that the user appears to be bored and become impatient, the user interaction engine 140 may determine to change the state of dialogue to a topic that is of interest of the user (e.g., based on the information from the user information database 130) in order to continue to engage the user in the conversation.”  ¶ [0067]: “The estimated mindsets of parties, whether related to humans or the automated companion (machine), may be relied on by the dialogue management at layer 3, to determine, e.g., how to carry on a conversation with a human conversant. How each dialogue progresses often represent a human user's preferences. Such preferences may be captured dynamically during the dialogue at utilities (layer 5). As shown in FIG. 4A, utilities at layer 5 represent evolving states that are indicative of parties' evolving preferences, which can also be used by the dialogue management at layer 3 to decide the appropriate or intelligent way to carry on the interaction.”   ¶ [0069]: “FIG. 4B illustrates a part of a dialogue tree of an on-going dialogue with paths taken based on interactions between the automated companion and a user, according to an embodiment of the present teaching. In this illustrated example, the dialogue management at layer 3 (of the automated companion) may predict multiple paths with which a dialogue, or more generally an interaction, with a user may proceed. In this example, each node may represent a point of the current state of the dialogue and each branch from a node may represent possible responses from a user. As shown in this example, at node 1, the automated companion may face with three separate paths which may be taken depending on a response detected from a user. If the user responds with an affirmative response, dialogue tree 400 may proceed from node 1 to node 2. At node 2, a response may be generated for the automated companion in response to the affirmative response from the user and may then be rendered to the user, which may include audio, visual, textual, haptic, or any combination thereof.”   ¶ [0084]: “In each dialogue of a certain topic, the dialogue manager 510 bases its control of the dialogue on relevant dialogue tree(s) that may or may not be associated with the topic (e.g., may inject small talks to enhance engagement). To generate a response to a user in a dialogue, the dialogue manager 510 may also consider additional information such as a state of the user, the surrounding of the dialogue scene, the emotion of the user, the estimated mindsets of the user and the agent, and the known preferences of the user (utilities).”  ¶ [0085]: “An output of DM 510 corresponds to an accordingly determined response to the user. To deliver a response to the user, the DM 510 may also formulate a way that the response is to be delivered. The form in which the response is to be delivered may be determined based on information from multiple sources, e.g., the user's emotion (e.g., if the user is a child who is not happy, the response may be rendered in a gentle voice), the user's utility (e.g., the user may prefer speech in certain accent similar to his parents'), or the surrounding environment that the user is in (e.g., noisy place so that the response needs to be delivered in a high volume). DM 510 may output the response determined together with such delivery parameters.”  ¶ [0130]: “In some embodiments, the integration may also be performed at an even lower level. For instance, the integration may be performed based on phonemes estimated based on sound (audio based) or visemes recognized based on lip reading (visual based).  FIG. 21 illustrates an exemplary scheme for integrating audio based speech recognition (ASR) and the lip reading based speech recognition, according to a different embodiment of the present teaching. As shown, speech signal is processed respectively via ASR and video data are processed via lip recognition. In some embodiments, the ASR generates phonemes and the lip reading generates visemes. To integrate the recognition results, comparison is performed between the recognition results. For example, the phonemes from the ASR are converted into visemes and similarity between the visemes converted from phonemes from the ASR and that from the lip reading are assessed. In some embodiments, if they are similar, e.g., the similarity exceeds a certain level, the recognition result from ASR is accepted because it is supported by the lip reading result. If the similarity level of the visemes from ASR and lip reading is below a set level, the visemes may be accepted but the recognition result may be associated with a low confidence score. In some embodiments, the automated dialogue companion or the agent may request the user engaged in the dialogue to speak louder so that the next round of recognition may be based on better signals. In some situations, if the similarity is low according to some criterion, the visemes may not be accepted and the automated dialogue companion may react to the situation by letting the user know that what is spoken cannot be discerned and ask the user to say it again.”); 
 	determine an action that matches the selected characteristic (e.g., ¶ [0087]: “To deliver a response, a deliverable form of the response may be generated via, e.g., verbal response generation and/or behavior response generation, as depicted in FIG. 5. Such a response in its determined deliverable form(s) may then be used by a renderer to actual render the response in its intended form(s). For a deliverable form in a natural language, the text of the response may be used to synthesize a speech signal via, e.g., text to speech techniques, in accordance with the delivery parameters (e.g., volume, accent, style, etc.). For any response or part thereof, that is to be delivered in a non-verbal form(s), e.g., with a certain expression, the intended non-verbal expression may be translated into, e.g., via animation, control signals that can be used to control certain parts of the agent device (physical representation of the automated companion) to perform certain mechanical movement to deliver the non-verbal expression of the response, e.g., nodding head, shrug shoulders, or whistle. In some embodiments, to deliver a response, certain software components may be invoked to render a different facial expression of the agent device. Such rendition(s) of the response may also be simultaneously carried out by the agent (e.g., speak a response with a joking voice and with a big smile on the face of the agent).”   ¶ [0090]: “On the output side of the processing level, when a certain response strategy is determined, such strategy may be translated into specific actions to take by the automated companion to respond to the other party. Such action may be carried out by either deliver some audio response or express certain emotion or attitude via certain gesture. When the response is to be delivered in audio, text with words that need to be spoken are processed by a text to speech module to produce audio signals and such audio signals are then sent to the speakers to render the speech as a response. In some embodiments, the speech generated based on text may be performed in accordance with other parameters, e.g., that may be used to control to generate the speech with certain tones or voices. If the response is to be delivered as a physical action, such as a body movement realized on the automated companion, the actions to be taken may also be instructions to be used to generate such body movement.”) by inspecting the library of potential actions associated with the virtual character (e.g., in FIG. 4A, the “Database” storing “Character config,” “Voice config”;   ¶ [0068]: “In some embodiments, some information that may be shared via the internal API may be accessed from an external database. For example, certain configurations related to a desired character for the agent device (a duck) may be accessed from, e.g., an open source database, that provide parameters (e.g., parameters to visually render the duck and/or parameters needed to render the speech from the duck).”  and/or ¶ [0087]: “verbal response generation and/or behavior response generation, as depicted in FIG. 5.”  NOTE:  A databases may be reasonably interpreted as being “a library.”)  (¶ [0068]: “In some embodiments, through the internal API, various layers (2-5) may access information created by or stored at other layers to support the processing. Such information may include common configuration to be applied to a dialogue (e.g., character of the agent device is an avatar, voice preferred, or a virtual environment to be created for the dialogue, etc.), a current state of the dialogue, a current dialogue history, known user preferences, estimated user intent/emotion/mindset, etc. In some embodiments, some information that may be shared via the internal API may be accessed from an external database. For example, certain configurations related to a desired character for the agent device (a duck) may be accessed from, e.g., an open source database, that provide parameters (e.g., parameters to visually render the duck and/or parameters needed to render the speech from the duck).”  ¶ [0069]: “At node 2, a response may be generated for the automated companion in response to the affirmative response from the user and may then be rendered to the user, which may include audio, visual, textual, haptic, or any combination thereof.”  ¶ [0081]: “Layer 1 also handles information rendering of a response from the automated dialogue companion to a user. In some embodiments, the rendering is performed by an agent device, e.g., 160-a and examples of such rendering include speech, expression which may be facial or physical acts performed. For instance, an agent device may render a text string received from the user interaction engine 140 (as a response to the user) to speech so that the agent device may utter the response to the user. In some embodiments, the text string may be sent to the agent device with additional rendering instructions such as volume, tone, pitch, etc. which may be used to convert the text string into a sound wave corresponding to an utterance of the content in a certain manner. In some embodiments, a response to be delivered to a user may also include animation, e.g., utter a response with an attitude which may be delivered via, e.g., a facial expression or a physical act such as raising one arm, etc. In some embodiments, the agent may be implemented as an application on a user device. In this situation, rendering of a response from the automated dialogue companion is implemented via the user device, e.g., 110-a (not shown in FIG. 5).”   ¶ [0083]: “The multimodal data understanding generated at layer 2 may be used by DM 510 to determine how to respond. To enhance engagement and user experience, the DM 510 may also determine a response based on the estimated mindsets of the user and of the agent from layer 4 as well as the utilities of the user engaged in the dialogue from layer 5. The mindsets of the parties involved in a dialogue may be estimated based on information from Layer 2 (e.g., estimated emotion of a user) and the progress of the dialogue. In some embodiments, the mindsets of a user and of an agent may be estimated dynamically during the course of a dialogue and such estimated mindsets may then be used to learn, together with other data, utilities of users. The learned utilities represent preferences of users in different dialogue scenarios and are estimated based on historic dialogues and the outcomes thereof.”   ¶ [0084]: “In each dialogue of a certain topic, the dialogue manager 510 bases its control of the dialogue on relevant dialogue tree(s) that may or may not be associated with the topic (e.g., may inject small talks to enhance engagement). To generate a response to a user in a dialogue, the dialogue manager 510 may also consider additional information such as a state of the user, the surrounding of the dialogue scene, the emotion of the user, the estimated mindsets of the user and the agent, and the known preferences of the user (utilities).”  ¶ [0085]: “An output of DM 510 corresponds to an accordingly determined response to the user. To deliver a response to the user, the DM 510 may also formulate a way that the response is to be delivered. The form in which the response is to be delivered may be determined based on information from multiple sources, e.g., the user's emotion (e.g., if the user is a child who is not happy, the response may be rendered in a gentle voice), the user's utility (e.g., the user may prefer speech in certain accent similar to his parents'), or the surrounding environment that the user is in (e.g., noisy place so that the response needs to be delivered in a high volume). DM 510 may output the response determined together with such delivery parameters.”  ¶ [0086]: “In some embodiments, the delivery of such determined response is achieved by generating the deliverable form(s) of each response in accordance with various parameters associated with the response. In a general case, a response is delivered in the form of speech in some natural language. A response may also be delivered in speech coupled with a particular nonverbal expression as a part of the delivered response, such as a nod, a shake of the head, a blink of the eyes, or a shrug. There may be other forms of deliverable form of a response that is acoustic but not verbal, e.g., a whistle.”  ¶ [0087]: “To deliver a response, a deliverable form of the response may be generated via, e.g., verbal response generation and/or behavior response generation, as depicted in FIG. 5. Such a response in its determined deliverable form(s) may then be used by a renderer to actual render the response in its intended form(s). For a deliverable form in a natural language, the text of the response may be used to synthesize a speech signal via, e.g., text to speech techniques, in accordance with the delivery parameters (e.g., volume, accent, style, etc.). For any response or part thereof, that is to be delivered in a non-verbal form(s), e.g., with a certain expression, the intended non-verbal expression may be translated into, e.g., via animation, control signals that can be used to control certain parts of the agent device (physical representation of the automated companion) to perform certain mechanical movement to deliver the non-verbal expression of the response, e.g., nodding head, shrug shoulders, or whistle. In some embodiments, to deliver a response, certain software components may be invoked to render a different facial expression of the agent device. Such rendition(s) of the response may also be simultaneously carried out by the agent (e.g., speak a response with a joking voice and with a big smile on the face of the agent).”),
 the action including audio to be outputted on the device (¶ [0081]: “Layer 1 also handles information rendering of a response from the automated dialogue companion to a user. In some embodiments, the rendering is performed by an agent device, e.g., 160-a and examples of such rendering include speech, expression which may be facial or physical acts performed. For instance, an agent device may render a text string received from the user interaction engine 140 (as a response to the user) to speech so that the agent device may utter the response to the user. In some embodiments, the text string may be sent to the agent device with additional rendering instructions such as volume, tone, pitch, etc. which may be used to convert the text string into a sound wave corresponding to an utterance of the content in a certain manner. In some embodiments, a response to be delivered to a user may also include animation, e.g., utter a response with an attitude which may be delivered via, e.g., a facial expression or a physical act such as raising one arm, etc.”   ¶ [0086]: “In some embodiments, the delivery of such determined response is achieved by generating the deliverable form(s) of each response in accordance with various parameters associated with the response. In a general case, a response is delivered in the form of speech in some natural language. A response may also be delivered in speech coupled with a particular nonverbal expression as a part of the delivered response, such as a nod, a shake of the head, a blink of the eyes, or a shrug.”    ¶ [0087]: “To deliver a response, a deliverable form of the response may be generated via, e.g., verbal response generation and/or behavior response generation, as depicted in FIG. 5. Such a response in its determined deliverable form(s) may then be used by a renderer to actual render the response in its intended form(s). For a deliverable form in a natural language, the text of the response may be used to synthesize a speech signal via, e.g., text to speech techniques, in accordance with the delivery parameters (e.g., volume, accent, style, etc.). For any response or part thereof, that is to be delivered in a non-verbal form(s), e.g., with a certain expression, the intended non-verbal expression may be translated into, e.g., via animation, control signals that can be used to control certain parts of the agent device (physical representation of the automated companion) to perform certain mechanical movement to deliver the non-verbal expression of the response, e.g., nodding head, shrug shoulders, or whistle. In some embodiments, to deliver a response, certain software components may be invoked to render a different facial expression of the agent device. Such rendition(s) of the response may also be simultaneously carried out by the agent (e.g., speak a response with a joking voice and with a big smile on the face of the agent).”); and 
 	output the audio on the device (¶ [0062]: “The exemplary automated companion 160-a as shown in FIG. 3B may also be controlled to “speak” via a speaker (330).”  ¶ [0081]: “Layer 1 also handles information rendering of a response from the automated dialogue companion to a user. In some embodiments, the rendering is performed by an agent device, e.g., 160-a and examples of such rendering include speech, expression which may be facial or physical acts performed. For instance, an agent device may render a text string received from the user interaction engine 140 (as a response to the user) to speech so that the agent device may utter the response to the user. In some embodiments, the text string may be sent to the agent device with additional rendering instructions such as volume, tone, pitch, etc. which may be used to convert the text string into a sound wave corresponding to an utterance of the content in a certain manner. In some embodiments, a response to be delivered to a user may also include animation, e.g., utter a response with an attitude which may be delivered via, e.g., a facial expression or a physical act such as raising one arm, etc. In some embodiments, the agent may be implemented as an application on a user device. In this situation, rendering of a response from the automated dialogue companion is implemented via the user device, e.g., 110-a (not shown in FIG. 5).”  ¶ [0086]: “In a general case, a response is delivered in the form of speech in some natural language. A response may also be delivered in speech coupled with a particular nonverbal expression as a part of the delivered response, such as a nod, a shake of the head, a blink of the eyes, or a shrug.”    ¶ [0087]: “To deliver a response, a deliverable form of the response may be generated via, e.g., verbal response generation and/or behavior response generation, as depicted in FIG. 5. Such a response in its determined deliverable form(s) may then be used by a renderer to actual render the response in its intended form(s). For a deliverable form in a natural language, the text of the response may be used to synthesize a speech signal via, e.g., text to speech techniques, in accordance with the delivery parameters (e.g., volume, accent, style, etc.). For any response or part thereof, that is to be delivered in a non-verbal form(s), e.g., with a certain expression, the intended non-verbal expression may be translated into, e.g., via animation, control signals that can be used to control certain parts of the agent device (physical representation of the automated companion) to perform certain mechanical movement to deliver the non-verbal expression of the response, e.g., nodding head, shrug shoulders, or whistle. In some embodiments, to deliver a response, certain software components may be invoked to render a different facial expression of the agent device. Such rendition(s) of the response may also be simultaneously carried out by the agent (e.g., speak a response with a joking voice and with a big smile on the face of the agent).”).
	Regarding claim 11 (depends on claim 9), SHUKLA discloses:
 	the at least two internal models include a prior knowledge model (e.g., ¶ [0064]: “a hierarchy of preferences”) capable of retrieving prior knowledge information comprising information relating to previous engagement with a user (e.g., ¶ [0064]: “such a hierarchy of preferences may evolve over time based on the party's choices made and likings exhibited during conversations.”) (¶ [0064]: “The term “utility” is hereby defined as preferences of a party identified based on states detected associated with dialogue histories. Utility may be associated with a party in a dialogue, whether the party is a human, the automated companion, or other intelligent devices. A utility for a particular party may represent different states of a world, whether physical, virtual, or even mental. For example, a state may be represented as a particular path along which a dialog walks through in a complex map of the world. At different instances, a current state evolves into a next state based on the interaction between multiple parties. States may also be party dependent, i.e., when different parties participate in an interaction, the states arising from such interaction may vary. A utility associated with a party may be organized as a hierarchy of preferences and such a hierarchy of preferences may evolve over time based on the party's choices made and likings exhibited during conversations. Such preferences, which may be represented as an ordered sequence of choices made out of different options, is what is referred to as utility. The present teaching discloses method and system by which an intelligent automated companion is capable of learning, through a dialogue with a human conversant, the user's utility.”  ¶ [0083]: “The multimodal data understanding generated at layer 2 may be used by DM 510 to determine how to respond. To enhance engagement and user experience, the DM 510 may also determine a response based on the estimated mindsets of the user and of the agent from layer 4 as well as the utilities of the user engaged in the dialogue from layer 5. The mindsets of the parties involved in a dialogue may be estimated based on information from Layer 2 (e.g., estimated emotion of a user) and the progress of the dialogue. In some embodiments, the mindsets of a user and of an agent may be estimated dynamically during the course of a dialogue and such estimated mindsets may then be used to learn, together with other data, utilities of users. The learned utilities represent preferences of users in different dialogue scenarios and are estimated based on historic dialogues and the outcomes thereof.”  ),
 wherein the selected characteristic (¶ [0083]: “how to respond”) is selected based on the prior knowledge information processed using the prior knowledge model (¶ [0083]: “The multimodal data understanding generated at layer 2 may be used by DM 510 to determine how to respond. To enhance engagement and user experience, the DM 510 may also determine a response based on the estimated mindsets of the user and of the agent from layer 4 as well as the utilities of the user engaged in the dialogue from layer 5. The mindsets of the parties involved in a dialogue may be estimated based on information from Layer 2 (e.g., estimated emotion of a user) and the progress of the dialogue. In some embodiments, the mindsets of a user and of an agent may be estimated dynamically during the course of a dialogue and such estimated mindsets may then be used to learn, together with other data, utilities of users. The learned utilities represent preferences of users in different dialogue scenarios and are estimated based on historic dialogues and the outcomes thereof.”  ¶ [0084]: “In each dialogue of a certain topic, the dialogue manager 510 bases its control of the dialogue on relevant dialogue tree(s) that may or may not be associated with the topic (e.g., may inject small talks to enhance engagement). To generate a response to a user in a dialogue, the dialogue manager 510 may also consider additional information such as a state of the user, the surrounding of the dialogue scene, the emotion of the user, the estimated mindsets of the user and the agent, and the known preferences of the user (utilities).”).
	Regarding claim 12 (depends on claim 9), SHUKLA discloses: 
	the at least two internal models includes a speech recognition model capable of parsing a speech sentiment from the speech information (¶ [0082]: “mental (e.g., certain emotion such as stress of the user estimated based on, e.g., the tune of the speech, a facial expression, or a gesture of the user).”  ¶ [0089]: “recognize various emotions of a party based on both visual information from a camera and the synchronized audio information. For instance, a happy emotion may often be accompanied with a smile face and a certain acoustic cue.”) and a facial feature recognition model capable of detecting a facial feature sentiment based on the facial expression information (¶ [0082]: “mental (e.g., certain emotion such as stress of the user estimated based on, e.g., the tune of the speech, a facial expression, or a gesture of the user).”  ¶ [0089]: “recognize various emotions of a party based on both visual information from a camera and the synchronized audio information. For instance, a happy emotion may often be accompanied with a smile face and a certain acoustic cue.”) (¶ [0082]: “Processed features of the multi-modal data may be further processed at layer 2 to achieve language understanding and/or multi-modal data understanding including visual, textual, and any combination thereof. Some of such understanding may be directed to a single modality, such as speech understanding, and some may be directed to an understanding of the surrounding of the user engaging in a dialogue based on integrated information. Such understanding may be physical (e.g., recognize certain objects in the scene), perceivable (e.g., recognize what the user said, or certain significant sound, etc.), or mental (e.g., certain emotion such as stress of the user estimated based on, e.g., the tune of the speech, a facial expression, or a gesture of the user).”  ¶ [0089] On the input side, the processing level may include speech processing module for performing, e.g., speech recognition based on audio signal obtained from an audio sensor (microphone) to understand what is being uttered in order to determine how to respond. The audio signal may also be recognized to generate text information for further analysis. The audio signal from the audio sensor may also be used by an emotion recognition processing module. The emotion recognition module may be designed to recognize various emotions of a party based on both visual information from a camera and the synchronized audio information. For instance, a happy emotion may often be accompanied with a smile face and a certain acoustic cue. The text information obtained via speech recognition may also be used by the emotion recognition module, as a part of the indication of the emotion, to estimate the emotion involved.”  ¶ [0066]: “In layer 1, multi-modal data may be acquired via sensors such as camera, microphone, keyboard, display, speakers, chat bubble, renderer, or other user interface elements. Such multi-modal data may be analyzed to estimated or infer various features that may be used to infer higher level characteristics such as expression, characters, gesture, emotion, action, attention, intent, etc.),
 	wherein the selected characteristic is a sentiment common among the speech sentiment and the facial feature sentiment (e.g., ¶ [0056]: “emotion/mindset of the user”;  ¶ [0056]: “the user appears to be bored and become impatient”; ¶ [0072]: “the user appears sad, not smiling, the user's speech is slow with a low voice”;  ¶ [0089]: “recognize various emotions of a party based on both visual information from a camera and the synchronized audio information.”  ¶ [0089]: “a happy emotion may often be accompanied with a smile face and a certain acoustic cue.”) (¶ [0054]: “As illustrated, during a human machine dialogue, a user, as the human conversant in the dialogue, may communicate across the network 120 with an agent device or the user interaction engine 140. Such communication may involve data in multiple modalities such as audio, video, text, etc. Via a user device, a user can send data (e.g., a request, audio signal representing an utterance of the user, or a video of the scene surrounding the user) and/or receive data (e.g., text or audio response from an agent device). In some embodiments, user data in multiple modalities, upon being received by an agent device or the user interaction engine 140, may be analyzed to understand the human user's speech or gesture so that the user's emotion or intent may be estimated and used to determine a response to the user.”   ¶ [0056]: “In some embodiments, such inputs from the user's site and processing results thereof may also be transmitted to the user interaction engine 140 for facilitating the user interaction engine 140 to better understand the specific situation associated with the dialogue so that the user interaction engine 140 may determine the state of the dialogue, emotion/mindset of the user, and to generate a response that is based on the specific situation of the dialogue and the intended purpose of the dialogue (e.g., for teaching a child the English vocabulary). For example, if information received from the user device indicates that the user appears to be bored and become impatient, the user interaction engine 140 may determine to change the state of dialogue to a topic that is of interest of the user (e.g., based on the information from the user information database 130) in order to continue to engage the user in the conversation.”   ¶ [0046]: “The present teaching aims to address the deficiencies of the traditional human machine dialogue systems and to provide methods and systems that enables a more effective and realistic human to machine dialogue. The present teaching incorporates artificial intelligence in an automated companion with an agent device in conjunction with the backbone support from a user interaction engine so that the automated companion can conduct a dialogue based on continuously monitored multimodal data indicative of the surrounding of the dialogue, adaptively estimating the mindset/emotion/intent of the participants of the dialogue, and adaptively adjust the conversation strategy based on the dynamically changing information/estimates/contextual information.”  ¶ [0066]: “In layer 1, multi-modal data may be acquired via sensors such as camera, microphone, keyboard, display, speakers, chat bubble, renderer, or other user interface elements. Such multi-modal data may be analyzed to estimated or infer various features that may be used to infer higher level characteristics such as expression, characters, gesture, emotion, action, attention, intent, etc. Such higher level characteristics may be obtained by processing units at layer 2 and the used by components of higher layers, via the internal API as shown in FIG. 4A, to e.g., intelligently infer or estimate additional information related to the dialogue at higher conceptual levels. For example, the estimated emotion, attention, or other characteristics of a participant of a dialogue obtained at layer 2 may be used to estimate the mindset of the participant. In some embodiments, such mindset may also be estimated at layer 4 based on additional information, e.g., recorded surrounding environment or other auxiliary information in such surrounding environment such as sound.”  ¶ [0072]: “Based on acquired multi-modal data, analysis may be performed by the automated companion (e.g., by the front end user device or by the backend user interaction engine 140) to assess the attitude, emotion, mindset, and utility of the users. For example, based on visual data analysis, the automated companion may detect that the user appears sad, not smiling, the user's speech is slow with a low voice. The characterization of the user's states in the dialogue may be performed at layer 2 based on multi-model data acquired at layer 1. Based on such detected observations, the automated companion may infer (at 406) that the user is not that interested in the current topic and not that engaged. Such inference of emotion or mental state of the user may, for instance, be performed at layer 4 based on characterization of the multi-modal data associated with the user.”  ¶ [0085]: “The form in which the response is to be delivered may be determined based on information from multiple sources, e.g., the user's emotion (e.g., if the user is a child who is not happy, the response may be rendered in a gentle voice)”  ¶ [0089]: “On the input side, the processing level may include speech processing module for performing, e.g., speech recognition based on audio signal obtained from an audio sensor (microphone) to understand what is being uttered in order to determine how to respond. The audio signal may also be recognized to generate text information for further analysis. The audio signal from the audio sensor may also be used by an emotion recognition processing module. The emotion recognition module may be designed to recognize various emotions of a party based on both visual information from a camera and the synchronized audio information. For instance, a happy emotion may often be accompanied with a smile face and a certain acoustic cue. The text information obtained via speech recognition may also be used by the emotion recognition module, as a part of the indication of the emotion, to estimate the emotion involved.”), and 
  	wherein the determined action is determined based on the sentiment (¶ [0054]: “the user's emotion or intent may be estimated and used to determine a response to the user.”;  ¶ [0056]: “determine the state of the dialogue, emotion/mindset of the user, and to generate a response that is based on the specific situation of the dialogue”) (¶ [0054]: “As illustrated, during a human machine dialogue, a user, as the human conversant in the dialogue, may communicate across the network 120 with an agent device or the user interaction engine 140. Such communication may involve data in multiple modalities such as audio, video, text, etc. Via a user device, a user can send data (e.g., a request, audio signal representing an utterance of the user, or a video of the scene surrounding the user) and/or receive data (e.g., text or audio response from an agent device). In some embodiments, user data in multiple modalities, upon being received by an agent device or the user interaction engine 140, may be analyzed to understand the human user's speech or gesture so that the user's emotion or intent may be estimated and used to determine a response to the user.”   ¶ [0056]: “In some embodiments, such inputs from the user's site and processing results thereof may also be transmitted to the user interaction engine 140 for facilitating the user interaction engine 140 to better understand the specific situation associated with the dialogue so that the user interaction engine 140 may determine the state of the dialogue, emotion/mindset of the user, and to generate a response that is based on the specific situation of the dialogue and the intended purpose of the dialogue (e.g., for teaching a child the English vocabulary). For example, if information received from the user device indicates that the user appears to be bored and become impatient, the user interaction engine 140 may determine to change the state of dialogue to a topic that is of interest of the user (e.g., based on the information from the user information database 130) in order to continue to engage the user in the conversation.”  ¶ [0046]: “The present teaching aims to address the deficiencies of the traditional human machine dialogue systems and to provide methods and systems that enables a more effective and realistic human to machine dialogue. The present teaching incorporates artificial intelligence in an automated companion with an agent device in conjunction with the backbone support from a user interaction engine so that the automated companion can conduct a dialogue based on continuously monitored multimodal data indicative of the surrounding of the dialogue, adaptively estimating the mindset/emotion/intent of the participants of the dialogue, and adaptively adjust the conversation strategy based on the dynamically changing information/estimates/contextual information.”   ¶ [0067]: “The estimated mindsets of parties, whether related to humans or the automated companion (machine), may be relied on by the dialogue management at layer 3, to determine, e.g., how to carry on a conversation with a human conversant.”  ¶ [0073]: “To respond to the user's current state (not engaged), the automated companion may determine to perk up the user in order to better engage the user. In this illustrated example, the automated companion may leverage what is available in the conversation environment by uttering a question to the user at 408: “Would you like to play a game?” Such a question may be delivered in an audio form as speech by converting text to speech, e.g., using customized voices individualized for the user. In this case, the user may respond by uttering, at 410, “Ok.” Based on the continuously acquired multi-model data related to the user, it may be observed, e.g., via processing at layer 2, that in response to the invitation to play a game, the user's eyes appear to be wandering, and in particular that the user's eyes may gaze towards where the basketball is located. At the same time, the automated companion may also observe that, once hearing the suggestion to play a game, the user's facial expression changes from “sad” to “smiling.” Based on such observed characteristics of the user, the automated companion may infer, at 412, that the user is interested in basketball.”  ¶ [0074]: “Based on the acquired new information and the inference based on that, the automated companion may decide to leverage the basketball available in the environment to make the dialogue more engaging for the user yet still achieving the educational goal for the user. In this case, the dialogue management at layer 3 may adapt the conversion to talk about a game and leverage the observation that the user gazed at the basketball in the room to make the dialogue more interesting to the user yet still achieving the goal of, e.g., educating the user. In one example embodiment, the automated companion generates a response, suggesting the user to play a spelling game” (at 414) and asking the user to spell the word “basketball.””  ¶ [0075]: “Given the adaptive dialogue strategy of the automated companion in light of the observations of the user and the environment, the user may respond providing the spelling of word “basketball.” (at 416). Observations are continuously made as to how enthusiastic the user is in answering the spelling question. If the user appears to respond quickly with a brighter attitude, determined based on, e.g., multi-modal data acquired when the user is answering the spelling question, the automated companion may infer, at 418, that the user is now more engaged. To further encourage the user to actively participate in the dialogue, the automated companion may then generate a positive response “Great job!” with instruction to deliver this response in a bright, encouraging, and positive voice to the user.”  ¶ [0083]: “The multimodal data understanding generated at layer 2 may be used by DM 510 to determine how to respond. To enhance engagement and user experience, the DM 510 may also determine a response based on the estimated mindsets of the user and of the agent from layer 4 as well as the utilities of the user engaged in the dialogue from layer 5. The mindsets of the parties involved in a dialogue may be estimated based on information from Layer 2 (e.g., estimated emotion of a user) and the progress of the dialogue. In some embodiments, the mindsets of a user and of an agent may be estimated dynamically during the course of a dialogue and such estimated mindsets may then be used to learn, together with other data, utilities of users.”  ¶ [0085]: “An output of DM 510 corresponds to an accordingly determined response to the user. To deliver a response to the user, the DM 510 may also formulate a way that the response is to be delivered. The form in which the response is to be delivered may be determined based on information from multiple sources, e.g., the user's emotion (e.g., if the user is a child who is not happy, the response may be rendered in a gentle voice), the user's utility (e.g., the user may prefer speech in certain accent similar to his parents'), or the surrounding environment that the user is in (e.g., noisy place so that the response needs to be delivered in a high volume). DM 510 may output the response determined together with such delivery parameters.”  ¶ [0090]: “On the output side of the processing level, when a certain response strategy is determined, such strategy may be translated into specific actions to take by the automated companion to respond to the other party. Such action may be carried out by either deliver some audio response or express certain emotion or attitude via certain gesture. When the response is to be delivered in audio, text with words that need to be spoken are processed by a text to speech module to produce audio signals and such audio signals are then sent to the speakers to render the speech as a response. In some embodiments, the speech generated based on text may be performed in accordance with other parameters, e.g., that may be used to control to generate the speech with certain tones or voices. If the response is to be delivered as a physical action, such as a body movement realized on the automated companion, the actions to be taken may also be instructions to be used to generate such body movement.”).
Claim Rejections – 35 USC § 103
 	The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102 of this title, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains.  Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries set forth in Graham v. John Deere Co., 383 U.S. 1, 148 USPQ 459 (1966), that are applied for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows: 

Determining the scope and contents of the prior art;
Ascertaining the differences between the prior art and the claims at issue;
Resolving the level of ordinary skill in the pertinent art; and
Considering objective evidence present in the application indicating obviousness or nonobviousness.

   	Claims 1-3, and 5 are rejected under 35 U.S.C. 103 as being unpatentable over SHUKLA et al. (US 2019/0279642) in view of PREVOST et al. (US 6,570,555, hereinafter “PREVOST”).
 	Regarding claim 1, SHUKLA discloses a method for controlling a virtual character (¶ [0009]: “methods, systems, and programming for a computerized intelligent agent.”  ¶ [0062]: “the automated companion 160-a may include reactive expressions in response to a user's response via, e.g., an interactive video cartoon character (e.g., avatar) displayed on, e.g., a screen as part of a face on the automated companion.”), the method comprising:
  	receiving multi-modal input information (¶ [0046]: “continuously monitored multimodal data indicative of the surrounding of the dialogue, adaptively estimating the mindset/emotion/intent of the participants of the dialogue, and adaptively adjust the conversation strategy based on the dynamically changing information/estimates/contextual information.”) from a device (¶ [0054]: “communication may involve data in multiple modalities such as audio, video, text, etc. Via a user device, a user can send data (e.g., a request, audio signal representing an utterance of the user, or a video of the scene surrounding the user) and/or receive data (e.g., text or audio response from an agent device). In some embodiments, user data in multiple modalities, upon being received by an agent device or the user interaction engine 140, may be analyzed to understand the human user's speech or gesture so that the user's emotion or intent may be estimated and used to determine a response to the user.”  ¶ [0058]: “In some embodiment, the multi-modal sensor data may first be processed on the user device and important features in different modalities may be extracted and sent to the user interaction engine 140 so that dialogue may be controlled with an understanding of the context. In some embodiments, the raw multi-modal sensor data may be sent directly to the user interaction engine 140 for processing.”  ¶ [0066]: “multi-modal data may be acquired via sensors such as camera, microphone, keyboard, display, speakers, chat bubble, renderer, or other user interface elements.”   ¶ [0119]: “the audio based speech recognition unit 930 and the lip reading based speech recognizer 950 may respectively receive audio and visual signals as input”), 
 	the multi-modal input information including any of 
 speech information (Abstract: “An audio signal is received that represents a speech of a user engaged in a dialogue.  A visual signal is received that captures the user uttering the speech.  A first speech recognition result is obtained by performing audio based speech recognition based on the audio signal. Based on the visual signal, lip movement of the user is detected and a second speech recognition result is obtained by performing lip reading based speech recognition.”    ¶ [0058]: “the user device 110-a may continually collect multi-modal sensor data related to the user and his/her surroundings, which may be analyzed to detect any information related to the dialogue and used to intelligently control the dialogue in an adaptive manner.”  ¶ [0071]: “To manage the dialogue, the automated companion may activate different sensors during the dialogue to make observation of the user and the surrounding environment. For example, the agent device may acquire multi-modal data about the surrounding environment where the user is in. Such multi-modal data may include audio, visual, or text data. For example, visual data may capture the facial expression of the user. The visual data may also reveal contextual information surrounding the scene of the conversation. For instance, a picture of the scene may reveal that there is a basketball, a table, and a chair, which provides information about the environment and may be leveraged in dialogue management to enhance engagement of the user. Audio data may capture not only the speech response of the user but also other peripheral information such as the tone of the response, the manner by which the user utters the response, or the accent of the user.”  ¶ [0110]: “The speech sound detector 1110 is provided to detect, from the input audio data, sounds that likely correspond to human speech activities based on, e.g., models 1120 that characterize human speech sound.”  ¶ [0120]: “Acoustic input data acquired by selected acoustic sensor(s) may then be sent to the audio based speech recognition unit 1530 for speech recognition based on audio data.”),
 facial expression information (Abstract: “A visual signal is received that captures the user uttering the speech.”  Abstract: “Based on the visual signal, lip movement of the user is detected and a second speech recognition result is obtained by performing lip reading based speech recognition.”    ¶ [0062]: “The automated companion may use a camera (320) to observe the user's presence, facial expressions, direction of gaze, surroundings, etc.”  ¶ [0071]: “To manage the dialogue, the automated companion may activate different sensors during the dialogue to make observation of the user and the surrounding environment. For example, the agent device may acquire multi-modal data about the surrounding environment where the user is in. Such multi-modal data may include audio, visual, or text data. For example, visual data may capture the facial expression of the user. The visual data may also reveal contextual information surrounding the scene of the conversation. For instance, a picture of the scene may reveal that there is a basketball, a table, and a chair, which provides information about the environment and may be leveraged in dialogue management to enhance engagement of the user. Audio data may capture not only the speech response of the user but also other peripheral information such as the tone of the response, the manner by which the user utters the response, or the accent of the user.”  ¶ [0120]: “Visual input data acquired by selected visual sensor may then be sent to the lip reading based speech recognition unit 1550 for speech recognition based on visual data.”  ¶ [0120]: “the lip reading based speech recognition unit 1550 performs speech recognition, at 1650, by comparing tracked lip movements (observed in the visual input data) against some lip reading model(s) appropriate for the underlying language for the speech recognition.”),
 and environmental information representing an environment surrounding the device (¶ [0051]: “During a conversation, an agent device may include one or more input mechanisms (e.g., cameras, microphones, touch screens, buttons, etc.) that allow the agent device to capture inputs related to the user or the local environment associated with the conversation. Such inputs may assist the automated companion to develop an understanding of the atmosphere surrounding the conversation (e.g., movements of the user, sound of the environment) and the mindset of the human conversant (e.g., user picks up a ball which may indicates that the user is bored) in order to enable the automated companion to react accordingly and conduct the conversation in a manner that will keep the user interested and engaging.”   ¶ [0055]: “Knowing that the surrounding environment of the user (based on visual information from the user device) shows green trees and lawns,”  ¶ [0058]: “the user device 110-a may continually collect multi-modal sensor data related to the user and his/her surroundings, which may be analyzed to detect any information related to the dialogue and used to intelligently control the dialogue in an adaptive manner.”   ¶ [0071]: “To manage the dialogue, the automated companion may activate different sensors during the dialogue to make observation of the user and the surrounding environment. For example, the agent device may acquire multi-modal data about the surrounding environment where the user is in. Such multi-modal data may include audio, visual, or text data. For example, visual data may capture the facial expression of the user. The visual data may also reveal contextual information surrounding the scene of the conversation. For instance, a picture of the scene may reveal that there is a basketball, a table, and a chair, which provides information about the environment and may be leveraged in dialogue management to enhance engagement of the user. Audio data may capture not only the speech response of the user but also other peripheral information such as the tone of the response, the manner by which the user utters the response, or the accent of the user.”   ¶ [0110]: “In some embodiments, depending on application needs, it is possible to also detect other types of sounds such as environmental sounds (beach, street, sports center, etc.), special event sounds (explosion, fire alarm, alerts, etc.). In this case, the 1120 may also include models that can be used to detect different types of sound in the dialogue scene.”); 
 	displaying the virtual character (e.g., ¶ [0062]: “an interactive video cartoon character (e.g., avatar) displayed”) in a position in a display environment (e.g., ¶ [0062]: “on, e.g., a screen as part of a face on the automated companion.”) (¶ [0062]: “Other exemplary functionalities of the automated companion 160-a may include reactive expressions in response to a user's response via, e.g., an interactive video cartoon character (e.g., avatar) displayed on, e.g., a screen as part of a face on the automated companion.”   ¶ [0068]: “Such information may include common configuration to be applied to a dialogue (e.g., character of the agent device is an avatar, voice preferred, or a virtual environment to be created for the dialogue, etc.), a current state of the dialogue, a current dialogue history, known user preferences, estimated user intent/emotion/mindset, etc.”);
 	implementing at least two internal models (e.g., ¶ [0042]: “detecting a spoken language based on multiple model based speech recognition”; ¶ [0094]: “sound models 715”; ¶ [0094]: “speech lip movement models 725“; ¶ [0106]: “a lip detection model 930.” ¶ [0110]: “based on, e.g., models 1120 that characterize human speech sound.”  ¶ [0110]: “models that can be used to detect different types of sound in the dialogue scene.“  ¶ [0126]: “the lip shape/sound model(s)”) to identify characteristics of the multi-modal input information (e.g., ¶ [0110]: “to detect, from the input audio data, sounds that likely correspond to human speech activities based on, e.g., models 1120 that characterize human speech sound.” ¶ {0110]: “models that can be used to detect different types of sound in the dialogue scene.”   ¶ [0095]: “audio cues that reveal human speech activities and video cues related to lip movement that evidences human speech”) (¶ [0094]: “The audio based sound source estimator 710 processes audio data collected from a dialogue scene and estimates one or more sound sources (for speech) based on sound models 715 (e.g., acoustic models for human speech). The visual based sound source estimator 720 is provided for estimating one or more candidate sources (directions in a dialogue scene) of speech activities in a dialogue scene based on visual cues. The visual based sound source estimator 720 processes image data collected from the dialogue scene, analyzes the visual information based on speech lip movement models 725 (e.g., visual models for lip movement in speech in certain languages), and estimates candidate sound source(s) where the human speech is occurring. The audio based sound source candidates estimated by the audio based sound source estimator 710 and the visual based sound source estimates from 720 are sent, respectively, to the sound source disambiguation unit 730 so that the estimated sound candidates determined based on different cues may be disambiguated to generate estimated source(s) of sound in a dialogue environment.” ¶ [0095]: “an integrated approach by combining audio and video cues, including audio cues that reveal human speech activities and video cues related to lip movement that evidences human speech. In operation, the visual based sound source estimator 720 receives, at 702 of FIG. 7B, image (video) data acquired from the dialogue scene and processes the video data to detect, at 712, lip movement based on speech lip movement models 725 for recognizing speech activities. In some embodiments, the speech lip movement models to be used for the detection may be selected with respect to a certain language.”  ¶ [0120]: “When the audio based speech recognition unit 1530 receives the audio signals from acoustic sensor(s), it performs, at 1630, speech recognition based on speech recognition models 1540”;  ¶ [0120]: “Similarly, when the lip reading based speech recognizer 1550 receives the visual data (video), it performs, at 1650, speech recognition based on lip reading in accordance with lip reading models 1560.”  ¶ [0120]: “Thus, the lip reading based speech recognition unit 1550 performs speech recognition, at 1650, by comparing tracked lip movements (observed in the visual input data) against some lip reading model(s) appropriate for the underlying language for the speech recognition. The appropriate lip reading model may be selected (from the lip reading models 1560) based on, e.g., an input related to language choice.”  ¶ [0126]: “Mapping lip shape and/or lip movement to a sound may involve viseme analysis, where a viseme may correspond to a generic image that is used to describe a particular sound. As commonly known, a viseme may be a visual equivalent of a phoneme or acoustic speech sound in a spoken language and can be used by hearing-impaired person to view sounds visually. To derive a viseme, the analysis needed may depend on the underlying spoken language. In the present teaching, the lip shape/sound model(s) from 1960 may be used for determining sounds corresponding to lip shapes. In recognizing visemes associated with a spoken language, an appropriate lip shape/sound model may be selected according to a known current language.”    ¶ [0095]: “automatic speech recognition (ASR)”;  ¶ [0130]: “ In some embodiments, the integration may also be performed at an even lower level. For instance, the integration may be performed based on phonemes estimated based on sound (audio based) or visemes recognized based on lip reading (visual based).  FIG. 21 illustrates an exemplary scheme for integrating audio based speech recognition (ASR) and the lip reading based speech recognition, according to a different embodiment of the present teaching. As shown, speech signal is processed respectively via ASR and video data are processed via lip recognition. In some embodiments, the ASR generates phonemes and the lip reading generates visemes. To integrate the recognition results, comparison is performed between the recognition results. For example, the phonemes from the ASR are converted into visemes and similarity between the visemes converted from phonemes from the ASR and that from the lip reading are assessed. In some embodiments, if they are similar, e.g., the similarity exceeds a certain level, the recognition result from ASR is accepted because it is supported by the lip reading result. If the similarity level of the visemes from ASR and lip reading is below a set level, the visemes may be accepted but the recognition result may be associated with a low confidence score. In some embodiments, the automated dialogue companion or the agent may request the user engaged in the dialogue to speak louder so that the next round of recognition may be based on better signals. In some situations, if the similarity is low according to some criterion, the visemes may not be accepted and the automated dialogue companion may react to the situation by letting the user know that what is spoken cannot be discerned and ask the user to say it again.”  ); 
 	inspecting the identified characteristics of the at least two internal models to determine (¶ [0130]: “similarity between the visemes converted from phonemes from the ASR and that from the lip reading are assessed.”) whether a first identified characteristic of the identified characteristics (e.g., ¶ [0130]: “phonemes estimated based on sound (audio based)”; ¶ [0130]: “ASR generates phonemes”) includes a threshold number of similar features (¶ [0130]: “the phonemes from the ASR are converted into visemes and similarity between the visemes converted from phonemes from the ASR and that from the lip reading are assessed.”  ¶ [0130]: “if they are similar, e.g., the similarity exceeds a certain level,”  ¶ [0130]: “If the similarity level of the visemes from ASR and lip reading is below a set level”) of a second identified characteristic of the identified characteristics (¶ [0130]: “visemes recognized based on lip reading (visual based);  ¶ [0130]: “the lip reading generates visemes”) (¶ [0130]: “In some embodiments, the integration may also be performed at an even lower level. For instance, the integration may be performed based on phonemes estimated based on sound (audio based) or visemes recognized based on lip reading (visual based).  FIG. 21 illustrates an exemplary scheme for integrating audio based speech recognition (ASR) and the lip reading based speech recognition, according to a different embodiment of the present teaching. As shown, speech signal is processed respectively via ASR and video data are processed via lip recognition. In some embodiments, the ASR generates phonemes and the lip reading generates visemes. To integrate the recognition results, comparison is performed between the recognition results. For example, the phonemes from the ASR are converted into visemes and similarity between the visemes converted from phonemes from the ASR and that from the lip reading are assessed. In some embodiments, if they are similar, e.g., the similarity exceeds a certain level, the recognition result from ASR is accepted because it is supported by the lip reading result. If the similarity level of the visemes from ASR and lip reading is below a set level, the visemes may be accepted but the recognition result may be associated with a low confidence score. In some embodiments, the automated dialogue companion or the agent may request the user engaged in the dialogue to speak louder so that the next round of recognition may be based on better signals. In some situations, if the similarity is low according to some criterion, the visemes may not be accepted and the automated dialogue companion may react to the situation by letting the user know that what is spoken cannot be discerned and ask the user to say it again.”); 
 	comparing the first identified characteristic (claim 2: “the first speech recognition result includes a plurality of phonemes”) and the second identified characteristic (claim 2: “the second speech recognition result includes a plurality of visemes.”) (¶ [0130]: “To integrate the recognition results, comparison is performed between the recognition results. For example, the phonemes from the ASR are converted into visemes and similarity between the visemes converted from phonemes from the ASR and that from the lip reading are assessed. In some embodiments, if they are similar, e.g., the similarity exceeds a certain level, the recognition result from ASR is accepted because it is supported by the lip reading result.” ¶ [0069]: “responses from a user”  NOTE: In other words, a determined response from a user comprises the integrated speech recognition results having the identified similar phonemes and visemes.) against information specific to the virtual character (e.g., ¶ [0069]: “paths which may be taken depending on a response detected from a user”;  NOTE: In other words, the recognized speech (which includes the particular phonemes and visemes) of each response from a user is compared against paths in the paths in the dialogue tree.) included in a virtual character knowledge model (e.g., ¶ [0069]: “a dialogue tree of an on-going dialogue”) (¶ [0069]: “FIG. 4B illustrates a part of a dialogue tree of an on-going dialogue with paths taken based on interactions between the automated companion and a user, according to an embodiment of the present teaching. In this illustrated example, the dialogue management at layer 3 (of the automated companion) may predict multiple paths with which a dialogue, or more generally an interaction, with a user may proceed. In this example, each branch from a node may represent possible responses from a user. As shown in this example, at node 1, the automated companion may face with three separate paths which may be taken depending on a response detected from a user. If the user responds with an affirmative response, dialogue tree 400 may proceed from node 1 to node 2. At node 2, a response may be generated for the automated companion in response to the affirmative response from the user and may then be rendered to the user, which may include audio, visual, textual, haptic, or any combination thereof.”  ¶ [0070]: “If, at node 1, the user responses negatively, the path is for this stage is from node 1 to node 10. If the user responds, at node 1, with a “so-so” response (e.g., not negative but also not positive), dialogue tree 400 may proceed to node 3, at which a response from the automated companion may be rendered and there may be three separate possible responses from the user, “No response,” “Positive Response,” and “Negative response,” corresponding to nodes 5, 6, and 7, respectively. Depending on the user's actual response with respect to the automated companion's response rendered at node 3, the dialogue management at layer 3 may then follow the dialogue accordingly. For instance, if the user responds at node 3 with a positive response, the automated companion moves to respond to the user at node 6. Similarly, depending on the user's reaction to the automated companion's response at node 6, the user may further respond with an answer that is correct. In this case, the dialogue state moves from node 6 to node 8, etc. In this illustrated example, the dialogue state during this period moved from node 1, to node 3, to node 6, and to node 8. The traverse through nodes 1, 3, 6, and 8 forms a path consistent with the underlying conversation between the automated companion and a user. As seen in FIG. 4B, the path representing the dialogue is represented by the solid lines connecting nodes 1, 3, 6, and 8, whereas the paths skipped during a dialogue is represented by the dashed lines.”   ¶ [0084]: “In each dialogue of a certain topic, the dialogue manager 510 bases its control of the dialogue on relevant dialogue tree(s) that may or may not be associated with the topic (e.g., may inject small talks to enhance engagement). To generate a response to a user in a dialogue, the dialogue manager 510 may also consider additional information such as a state of the user, the surrounding of the dialogue scene, the emotion of the user, the estimated mindsets of the user and the agent, and the known preferences of the user (utilities).”  ¶ [0056]: “In some embodiments, such inputs from the user's site and processing results thereof may also be transmitted to the user interaction engine 140 for facilitating the user interaction engine 140 to better understand the specific situation associated with the dialogue so that the user interaction engine 140 may determine the state of the dialogue, emotion/mindset of the user, and to generate a response that is based on the specific situation of the dialogue and the intended purpose of the dialogue (e.g., for teaching a child the English vocabulary). For example, if information received from the user device indicates that the user appears to be bored and become impatient, the user interaction engine 140 may determine to change the state of dialogue to a topic that is of interest of the user (e.g., based on the information from the user information database 130) in order to continue to engage the user in the conversation.”) 
 to select a selected characteristic (¶ [0054]: “to determine a response to the user.”  ¶ [0056]: “determine the state of the dialogue, emotion/mindset of the user, and to generate a response that is based on the specific situation of the dialogue and the intended purpose of the dialogue“) based on determining that the first identified characteristic includes the threshold number of similar features of the second identified characteristic of the identified characteristics (¶ [0130]: “if they are similar, e.g., the similarity exceeds a certain level, the recognition result from ASR is accepted because it is supported by the lip reading result”)  (¶ [0056]: “In some embodiments, such inputs from the user's site and processing results thereof may also be transmitted to the user interaction engine 140 for facilitating the user interaction engine 140 to better understand the specific situation associated with the dialogue so that the user interaction engine 140 may determine the state of the dialogue, emotion/mindset of the user, and to generate a response that is based on the specific situation of the dialogue and the intended purpose of the dialogue (e.g., for teaching a child the English vocabulary). For example, if information received from the user device indicates that the user appears to be bored and become impatient, the user interaction engine 140 may determine to change the state of dialogue to a topic that is of interest of the user (e.g., based on the information from the user information database 130) in order to continue to engage the user in the conversation.”  ¶ [0067]: “The estimated mindsets of parties, whether related to humans or the automated companion (machine), may be relied on by the dialogue management at layer 3, to determine, e.g., how to carry on a conversation with a human conversant. How each dialogue progresses often represent a human user's preferences. Such preferences may be captured dynamically during the dialogue at utilities (layer 5). As shown in FIG. 4A, utilities at layer 5 represent evolving states that are indicative of parties' evolving preferences, which can also be used by the dialogue management at layer 3 to decide the appropriate or intelligent way to carry on the interaction.”   ¶ [0069]: “FIG. 4B illustrates a part of a dialogue tree of an on-going dialogue with paths taken based on interactions between the automated companion and a user, according to an embodiment of the present teaching. In this illustrated example, the dialogue management at layer 3 (of the automated companion) may predict multiple paths with which a dialogue, or more generally an interaction, with a user may proceed. In this example, each node may represent a point of the current state of the dialogue and each branch from a node may represent possible responses from a user. As shown in this example, at node 1, the automated companion may face with three separate paths which may be taken depending on a response detected from a user. If the user responds with an affirmative response, dialogue tree 400 may proceed from node 1 to node 2. At node 2, a response may be generated for the automated companion in response to the affirmative response from the user and may then be rendered to the user, which may include audio, visual, textual, haptic, or any combination thereof.”   ¶ [0084]: “In each dialogue of a certain topic, the dialogue manager 510 bases its control of the dialogue on relevant dialogue tree(s) that may or may not be associated with the topic (e.g., may inject small talks to enhance engagement). To generate a response to a user in a dialogue, the dialogue manager 510 may also consider additional information such as a state of the user, the surrounding of the dialogue scene, the emotion of the user, the estimated mindsets of the user and the agent, and the known preferences of the user (utilities).”  ¶ [0085]: “An output of DM 510 corresponds to an accordingly determined response to the user. To deliver a response to the user, the DM 510 may also formulate a way that the response is to be delivered. The form in which the response is to be delivered may be determined based on information from multiple sources, e.g., the user's emotion (e.g., if the user is a child who is not happy, the response may be rendered in a gentle voice), the user's utility (e.g., the user may prefer speech in certain accent similar to his parents'), or the surrounding environment that the user is in (e.g., noisy place so that the response needs to be delivered in a high volume). DM 510 may output the response determined together with such delivery parameters.”  ¶ [0130]: “In some embodiments, the integration may also be performed at an even lower level. For instance, the integration may be performed based on phonemes estimated based on sound (audio based) or visemes recognized based on lip reading (visual based).  FIG. 21 illustrates an exemplary scheme for integrating audio based speech recognition (ASR) and the lip reading based speech recognition, according to a different embodiment of the present teaching. As shown, speech signal is processed respectively via ASR and video data are processed via lip recognition. In some embodiments, the ASR generates phonemes and the lip reading generates visemes. To integrate the recognition results, comparison is performed between the recognition results. For example, the phonemes from the ASR are converted into visemes and similarity between the visemes converted from phonemes from the ASR and that from the lip reading are assessed. In some embodiments, if they are similar, e.g., the similarity exceeds a certain level, the recognition result from ASR is accepted because it is supported by the lip reading result. If the similarity level of the visemes from ASR and lip reading is below a set level, the visemes may be accepted but the recognition result may be associated with a low confidence score. In some embodiments, the automated dialogue companion or the agent may request the user engaged in the dialogue to speak louder so that the next round of recognition may be based on better signals. In some situations, if the similarity is low according to some criterion, the visemes may not be accepted and the automated dialogue companion may react to the situation by letting the user know that what is spoken cannot be discerned and ask the user to say it again.”);
  	accessing a library of potential actions associated with the virtual character (e.g., in FIG. 4A, the “Database” storing “Character config,” “Voice config”;   ¶ [0068]: “In some embodiments, some information that may be shared via the internal API may be accessed from an external database. For example, certain configurations related to a desired character for the agent device (a duck) may be accessed from, e.g., an open source database, that provide parameters (e.g., parameters to visually render the duck and/or parameters needed to render the speech from the duck).”  and/or ¶ [0087]: “verbal response generation and/or behavior response generation, as depicted in FIG. 5.”  NOTE:  A databases may be reasonably interpreted as being “a library.”) to determine an action that matches the selected characteristic (e.g., ¶ [0087]: “To deliver a response, a deliverable form of the response may be generated via, e.g., verbal response generation and/or behavior response generation, as depicted in FIG. 5. Such a response in its determined deliverable form(s) may then be used by a renderer to actual render the response in its intended form(s). For a deliverable form in a natural language, the text of the response may be used to synthesize a speech signal via, e.g., text to speech techniques, in accordance with the delivery parameters (e.g., volume, accent, style, etc.). For any response or part thereof, that is to be delivered in a non-verbal form(s), e.g., with a certain expression, the intended non-verbal expression may be translated into, e.g., via animation, control signals that can be used to control certain parts of the agent device (physical representation of the automated companion) to perform certain mechanical movement to deliver the non-verbal expression of the response, e.g., nodding head, shrug shoulders, or whistle. In some embodiments, to deliver a response, certain software components may be invoked to render a different facial expression of the agent device. Such rendition(s) of the response may also be simultaneously carried out by the agent (e.g., speak a response with a joking voice and with a big smile on the face of the agent).”   ¶ [0090]: “On the output side of the processing level, when a certain response strategy is determined, such strategy may be translated into specific actions to take by the automated companion to respond to the other party. Such action may be carried out by either deliver some audio response or express certain emotion or attitude via certain gesture. When the response is to be delivered in audio, text with words that need to be spoken are processed by a text to speech module to produce audio signals and such audio signals are then sent to the speakers to render the speech as a response. In some embodiments, the speech generated based on text may be performed in accordance with other parameters, e.g., that may be used to control to generate the speech with certain tones or voices. If the response is to be delivered as a physical action, such as a body movement realized on the automated companion, the actions to be taken may also be instructions to be used to generate such body movement.”)   (¶ [0068]: “In some embodiments, through the internal API, various layers (2-5) may access information created by or stored at other layers to support the processing. Such information may include common configuration to be applied to a dialogue (e.g., character of the agent device is an avatar, voice preferred, or a virtual environment to be created for the dialogue, etc.), a current state of the dialogue, a current dialogue history, known user preferences, estimated user intent/emotion/mindset, etc. In some embodiments, some information that may be shared via the internal API may be accessed from an external database. For example, certain configurations related to a desired character for the agent device (a duck) may be accessed from, e.g., an open source database, that provide parameters (e.g., parameters to visually render the duck and/or parameters needed to render the speech from the duck).”  ¶ [0069]: “At node 2, a response may be generated for the automated companion in response to the affirmative response from the user and may then be rendered to the user, which may include audio, visual, textual, haptic, or any combination thereof.”  ¶ [0081]: “Layer 1 also handles information rendering of a response from the automated dialogue companion to a user. In some embodiments, the rendering is performed by an agent device, e.g., 160-a and examples of such rendering include speech, expression which may be facial or physical acts performed. For instance, an agent device may render a text string received from the user interaction engine 140 (as a response to the user) to speech so that the agent device may utter the response to the user. In some embodiments, the text string may be sent to the agent device with additional rendering instructions such as volume, tone, pitch, etc. which may be used to convert the text string into a sound wave corresponding to an utterance of the content in a certain manner. In some embodiments, a response to be delivered to a user may also include animation, e.g., utter a response with an attitude which may be delivered via, e.g., a facial expression or a physical act such as raising one arm, etc. In some embodiments, the agent may be implemented as an application on a user device. In this situation, rendering of a response from the automated dialogue companion is implemented via the user device, e.g., 110-a (not shown in FIG. 5).”   ¶ [0083]: “The multimodal data understanding generated at layer 2 may be used by DM 510 to determine how to respond. To enhance engagement and user experience, the DM 510 may also determine a response based on the estimated mindsets of the user and of the agent from layer 4 as well as the utilities of the user engaged in the dialogue from layer 5. The mindsets of the parties involved in a dialogue may be estimated based on information from Layer 2 (e.g., estimated emotion of a user) and the progress of the dialogue. In some embodiments, the mindsets of a user and of an agent may be estimated dynamically during the course of a dialogue and such estimated mindsets may then be used to learn, together with other data, utilities of users. The learned utilities represent preferences of users in different dialogue scenarios and are estimated based on historic dialogues and the outcomes thereof.”   ¶ [0084]: “In each dialogue of a certain topic, the dialogue manager 510 bases its control of the dialogue on relevant dialogue tree(s) that may or may not be associated with the topic (e.g., may inject small talks to enhance engagement). To generate a response to a user in a dialogue, the dialogue manager 510 may also consider additional information such as a state of the user, the surrounding of the dialogue scene, the emotion of the user, the estimated mindsets of the user and the agent, and the known preferences of the user (utilities).”  ¶ [0085]: “An output of DM 510 corresponds to an accordingly determined response to the user. To deliver a response to the user, the DM 510 may also formulate a way that the response is to be delivered. The form in which the response is to be delivered may be determined based on information from multiple sources, e.g., the user's emotion (e.g., if the user is a child who is not happy, the response may be rendered in a gentle voice), the user's utility (e.g., the user may prefer speech in certain accent similar to his parents'), or the surrounding environment that the user is in (e.g., noisy place so that the response needs to be delivered in a high volume). DM 510 may output the response determined together with such delivery parameters.”  ¶ [0086]: “In some embodiments, the delivery of such determined response is achieved by generating the deliverable form(s) of each response in accordance with various parameters associated with the response. In a general case, a response is delivered in the form of speech in some natural language. A response may also be delivered in speech coupled with a particular nonverbal expression as a part of the delivered response, such as a nod, a shake of the head, a blink of the eyes, or a shrug. There may be other forms of deliverable form of a response that is acoustic but not verbal, e.g., a whistle.”  ¶ [0087]: “To deliver a response, a deliverable form of the response may be generated via, e.g., verbal response generation and/or behavior response generation, as depicted in FIG. 5. Such a response in its determined deliverable form(s) may then be used by a renderer to actual render the response in its intended form(s). For a deliverable form in a natural language, the text of the response may be used to synthesize a speech signal via, e.g., text to speech techniques, in accordance with the delivery parameters (e.g., volume, accent, style, etc.). For any response or part thereof, that is to be delivered in a non-verbal form(s), e.g., with a certain expression, the intended non-verbal expression may be translated into, e.g., via animation, control signals that can be used to control certain parts of the agent device (physical representation of the automated companion) to perform certain mechanical movement to deliver the non-verbal expression of the response, e.g., nodding head, shrug shoulders, or whistle. In some embodiments, to deliver a response, certain software components may be invoked to render a different facial expression of the agent device. Such rendition(s) of the response may also be simultaneously carried out by the agent (e.g., speak a response with a joking voice and with a big smile on the face of the agent).”),
 the action including both an animation to be performed by the virtual character and associated audio (¶ [0081]: “Layer 1 also handles information rendering of a response from the automated dialogue companion to a user. In some embodiments, the rendering is performed by an agent device, e.g., 160-a and examples of such rendering include speech, expression which may be facial or physical acts performed. For instance, an agent device may render a text string received from the user interaction engine 140 (as a response to the user) to speech so that the agent device may utter the response to the user. In some embodiments, the text string may be sent to the agent device with additional rendering instructions such as volume, tone, pitch, etc. which may be used to convert the text string into a sound wave corresponding to an utterance of the content in a certain manner. In some embodiments, a response to be delivered to a user may also include animation, e.g., utter a response with an attitude which may be delivered via, e.g., a facial expression or a physical act such as raising one arm, etc.”   ¶ [0086]: “In some embodiments, the delivery of such determined response is achieved by generating the deliverable form(s) of each response in accordance with various parameters associated with the response. In a general case, a response is delivered in the form of speech in some natural language. A response may also be delivered in speech coupled with a particular nonverbal expression as a part of the delivered response, such as a nod, a shake of the head, a blink of the eyes, or a shrug.”    ¶ [0087]: “To deliver a response, a deliverable form of the response may be generated via, e.g., verbal response generation and/or behavior response generation, as depicted in FIG. 5. Such a response in its determined deliverable form(s) may then be used by a renderer to actual render the response in its intended form(s). For a deliverable form in a natural language, the text of the response may be used to synthesize a speech signal via, e.g., text to speech techniques, in accordance with the delivery parameters (e.g., volume, accent, style, etc.). For any response or part thereof, that is to be delivered in a non-verbal form(s), e.g., with a certain expression, the intended non-verbal expression may be translated into, e.g., via animation, control signals that can be used to control certain parts of the agent device (physical representation of the automated companion) to perform certain mechanical movement to deliver the non-verbal expression of the response, e.g., nodding head, shrug shoulders, or whistle. In some embodiments, to deliver a response, certain software components may be invoked to render a different facial expression of the agent device. Such rendition(s) of the response may also be simultaneously carried out by the agent (e.g., speak a response with a joking voice and with a big smile on the face of the agent).”);  and
  	implementing the determined action by modifying the virtual character  (¶ [0062]: “The exemplary automated companion 160-a as shown in FIG. 3B may also be controlled to “speak” via a speaker (330).”  ¶ [0081]: “Layer 1 also handles information rendering of a response from the automated dialogue companion to a user. In some embodiments, the rendering is performed by an agent device, e.g., 160-a and examples of such rendering include speech, expression which may be facial or physical acts performed. For instance, an agent device may render a text string received from the user interaction engine 140 (as a response to the user) to speech so that the agent device may utter the response to the user. In some embodiments, the text string may be sent to the agent device with additional rendering instructions such as volume, tone, pitch, etc. which may be used to convert the text string into a sound wave corresponding to an utterance of the content in a certain manner. In some embodiments, a response to be delivered to a user may also include animation, e.g., utter a response with an attitude which may be delivered via, e.g., a facial expression or a physical act such as raising one arm, etc. In some embodiments, the agent may be implemented as an application on a user device. In this situation, rendering of a response from the automated dialogue companion is implemented via the user device, e.g., 110-a (not shown in FIG. 5).”  ¶ [0086]: “In a general case, a response is delivered in the form of speech in some natural language. A response may also be delivered in speech coupled with a particular nonverbal expression as a part of the delivered response, such as a nod, a shake of the head, a blink of the eyes, or a shrug.”    ¶ [0087]: “To deliver a response, a deliverable form of the response may be generated via, e.g., verbal response generation and/or behavior response generation, as depicted in FIG. 5. Such a response in its determined deliverable form(s) may then be used by a renderer to actual render the response in its intended form(s). For a deliverable form in a natural language, the text of the response may be used to synthesize a speech signal via, e.g., text to speech techniques, in accordance with the delivery parameters (e.g., volume, accent, style, etc.). For any response or part thereof, that is to be delivered in a non-verbal form(s), e.g., with a certain expression, the intended non-verbal expression may be translated into, e.g., via animation, control signals that can be used to control certain parts of the agent device (physical representation of the automated companion) to perform certain mechanical movement to deliver the non-verbal expression of the response, e.g., nodding head, shrug shoulders, or whistle. In some embodiments, to deliver a response, certain software components may be invoked to render a different facial expression of the agent device. Such rendition(s) of the response may also be simultaneously carried out by the agent (e.g., speak a response with a joking voice and with a big smile on the face of the agent).”).
   	SHUKLA fails to explicitly disclose:  implementing the determined action by modifying the virtual character “in the environment presented on the device”.
  	Whereas SHUKLA may not be entirely explicit as to, PREVOST teaches a method for controlling a virtual character (Abstract: “process multi-modal inputs and direct movements and speech of a synthetic character”), the method comprising: 
 	receiving multi-modal input information (col. 14, 49-50: “the understanding module 610 receives inputs from the input manager 520,”  col. 29, lines 21-24: “accepting, using a multi-modal input, data defining a physical space domain) from a device (col. 11, lines 6-9: “An Input Manager 520 interfaces with all input devices 540, performs signal processing on input data, and routes the results to the reactive component 500 and the deliberative component 510”;  col. 11, lines 58-62: “The input manager 520 is primarily responsible for obtaining data from input devices, converting it into a form acceptable by the rest of the system (via algorithmic signal processing routines), and routing the results to the understanding module 610 and/or reactive component 500.”  col. 3, lines 59-60: “an interface for a system,”  col. 4, lines 3-4: “a method for operating a device,”) (col. 3, lines 65-66: “a multi-modal interface that captures user input,”  col. 6, line 38: “multimodal inputs”;  col. 6, lines 40-41: “for controlling a wide variety of systems and devices,” col. 28, lines 1-2: “obtaining said data from said input devices,”   col. 28, lines 36-38: “multiple input channels, each of said channels configured to transfer at least one of said user inputs to the interface.”  col. 28, lines 40-43: “wherein said multiple input channels are configured to capture at least one of speech, body position, gaze direction, gesture recognition, keyboard inputs, mouse inputs, user ID, and motion detection of said user.”  col. 28, lines 46-57: “a microphone configured to capture speech of said user; a video camera configured to capture at least one of gestures, body position, and gaze direction of said user; a compass attached to said user and configured to capture a direction of said user; and, wherein said multiple input channels include, a microphone input channel connected to said microphone, a video input channel connected to said video camera, and a direction input channel connected to said compass.”  col. 29, lines 21-26: “accepting, using a multi-modal input, data defining a physical space domain distinct from said virtual environment, said physical space domain including the physical space occupied by the user and the visible representation of said virtual environment;”), 
 	the multi-modal input information including any of speech information (col. 7, lines 7-12: “Information about the world comes from raw input data representing non-verbal behaviors, such as coordinates that specify where the human user is looking, as well as processed data from deliberative modules, such as representations of the meanings of the user's utterances or gestures.” col. 28, lines 40-43: “wherein said multiple input channels are configured to capture at least one of speech, body position, gaze direction, gesture recognition, keyboard inputs, mouse inputs, user ID, and motion detection of said user.” col. 28, lines 46-57: “a microphone configured to capture speech of said user, a video camera configured to capture at least one of gestures, body position, and gaze direction of said user; a compass attached to said user and configured to capture a direction of said user; and, wherein said multiple input channels include, a microphone input channel connected to said microphone, a video input channel connected to said video camera, and a direction input channel connected to said compass.”),
 facial expression information (col. 7, line 12: “gestures”; col. 28, lines 40-43: “wherein said multiple input channels are configured to capture at least one of speech, body position, gaze direction, gesture recognition, keyboard inputs, mouse inputs, user ID, and motion detection of said user.”  col. 28, lines 46-57: “a microphone configured to capture speech of said user; a video camera configured to capture at least one of gestures, body position, and gaze direction of said user; a compass attached to said user and configured to capture a direction of said user; and, wherein said multiple input channels include, a microphone input channel connected to said microphone, a video input channel connected to said video camera, and a direction input channel connected to said compass.”  col. 19, line 48: “Gaze direction sensor ("eye tracker")”; col. 19, line 52: “User Identification: using face recognition software”;  col. 5, line 65 – col. 6, line 6: “There are at least two important roles played by gestures, facial expressions, and intonational patterns in face-to-face interactions. First, these "paraverbal" behaviors can convey semantic content that cannot be recovered from the speech alone, such as when a speaker identifies an object by making a pointing gesture. Second, they can signal subtle interactional cues that regulate the flow of information between interlocutors, such as when a speaker's intonation rises at the end of a phrase in attempting to hold the speaking floor.”  col. 6, lines 11-15: “The present invention provides an architecture for building animated, ally character interfaces with task-based, face-to-face conversational abilities (i.e., the ability to perceive and produce paraverbal behaviors to exchange both semantic and interactional”),
 and environmental information representing an environment surrounding the device (col. 6, line 67 – col. 7, line 2: “the various inputs allow the interface to track (via video or sensor based equipment), or recognize where equipment is located.”  col. 7, lines 7-12: “Information about the world comes from raw input data representing non-verbal behaviors, such as coordinates that specify where the human user is looking, as well as processed data from deliberative modules, such as representations of the meanings of the user's utterances or gestures.” ) (col. 27, lines 64-67: “a plurality of input devices, for use in accepting data defining a physical space domain, said physical space domain including the physical space occupied by the user;”   col. 28, lines 40-43: “wherein said multiple input channels are configured to capture at least one of speech, body position, gaze direction, gesture recognition, keyboard inputs, mouse inputs, user ID, and motion detection of said user.”  col. 28, lines 46-57: “a microphone configured to capture speech of said user; a video camera configured to capture at least one of gestures, body position, and gaze direction of said user; a compass attached to said user and configured to capture a direction of said user; and, wherein said multiple input channels include, a microphone input channel connected to said microphone, a video input channel connected to said video camera, and a direction input channel connected to said compass.”); 
 	displaying the virtual character (e.g., col. 3, line 57: “a conversational character”) in a position in a display environment (e.g., col. 3, line 57-58: “within a virtual space of the character”; Abstract: “a virtual space where the character is displayed”; col. 4, lines 4-5: “displaying a virtual space”) presented on the device    (e.g., Abstract: “the character is displayed”; col. 27, lines 61-63: “a display device, for use in displaying to the user a visible representation of a computer generated virtual space, including an animated virtual character therein;”) (See virtual character being displayed on the screen in the display environment in FIG. 3;  col. 3, lines 56-58: “provide a conversational character that interacts within a virtual space of the character and a physical space of the user.” col. 4, lines 3-4: “a method for operating a device, including the steps of displaying a virtual space;”   col. 5, lines 28-32: “building animated, virtual interface characters that serve as the user's ally in using a complex computational system, providing assistance in gathering information, executing commands or completing tasks.”  col. 5, line 39: “animated character interfaces”;  col. 6, lines 16-30: “Characters developed under the architecture of the present invention have the ability to perceive the user's speech, body position, certain classes of gestures, and general direction of gaze. In response, such characters can speak, make gestures and facial expressions, and direct their gaze appropriately. These characters provide an embodied interface, apparently separate from the underlying system, that the user can simply approach and interact with naturally to gather information or perform some task. An example interaction is shown in FIG. 3, in which the character is explaining how to control the room lighting from a panel display on the podium. Instead of having to read a help menu or other pre-scripted response, the user is interacting with the character which works with the user to solve the problem at hand.”  col. 6, lines 45-47: “modeling autonomous anthropomorphized characters who engage in face-to-face interactions.”  col. 6, lines 63-67: “provides a dynamic interaction between the character and the physical space occupied by the user. The character may point to objects in the physical space, or the user may point to objects in a virtual space occupied by the character.” col. 29, lines 17-20: “displaying to the user, using a display device, a visible representation of a computer generated virtual environment, including an animated virtual character therein;”); 
 	implementing at least two internal models (e.g., col. 14: lines 54-56: “applies a set of rules about how the inputs relate to each other in context with the current discourse and knowledge of the domain.”) to identify characteristics of the multi- modal input information (e.g., col. 14: lines 54-56: “applies a set of rules about how the inputs relate to each other in context with the current discourse and knowledge of the domain.”) (col. 4, lines 5-7: “retrieving user inputs from a user in a physical space; and combining both deliberative and reactive processing on said inputs to formulate a response. col. 6, lines 47-54: “A reactive architecture gives a character the ability to react immediately to verbal and non-verbal cues without performing any deep linguistic analysis. Such cues allow the character to convincingly signal turn-taking and other regulatory behaviors non-verbally. A deliberative architecture, on the other hand, gives the characters the ability to plan complex dialogue, without the need to respond reactively to non-verbal inputs in real time.”  col. 11, lines 16-21: “The reactive component 500 performs the "action selection" function in the system, determining what the character does at each time step. The deliberative component 510 performs functions such as uni-modal (speech only, for example) and multi-modal understanding of input data (perception), action/response, and action generation.”   col. 11, lines 24-26: “The understanding module 610 performs multimodal sensor unification to understand what the user is doing or communicating.”  col. 11, lines 58-62: “The input manager 520 is primarily responsible for obtaining data from input devices, converting it into a form acceptable by the rest of the system (via algorithmic signal processing routines), and routing the results to the understanding module 610 and/or reactive component 500.”  col. 14, lines 45-56: “The understanding module 610 is responsible for fusing all input modalities into a coherent understanding of the world, including what the user is doing. To perform its task, the understanding module 610 receives inputs from the input manager 520, accesses knowledge about the domain (static knowledge base) and inferred from the current discourse (dynamic knowledge base), and also accesses the current discourse context (discourse model 720). The understanding module then applies a set of rules about how the inputs relate to each other in context with the current discourse and knowledge of the domain.”); 
 	accessing a library (e.g., col. 28, line 8: “a knowledge base,” e.g., col. 9, lines 45-67: TABLE 3) of potential actions associated with the virtual character (e.g., col. 28, line 11-12: “storing actions by the virtual character within the virtual space;” col. 29, lines 29-30: “mapping actions by the virtual character within the virtual environment”  e.g., the Behaviors in TABLE 3) to determine an action that matches the selected characteristic (e.g., such as in TABLE 3, for the Output Function “Open interaction”, the corresponding Behavior “Look at user. Smile. Headtoss.”) (col. 29, lines 29-35: “mapping actions by the virtual character within the virtual environment to said visible representation, such that when displayed on the display device the actions of the virtual character are perceived by the user as interacting with the physical space occupied by the user;” col. 3, lines 50-51: “determines what action the conversational character takes.” col. 3, lines 61-64: “a processing component that integrates deliberative and reactive processing performed on said user inputs, and an output mechanism for performing actions based on the deliberative and reactive processing.” col. 28, lines 15-25: “an understanding module for use in receiving inputs from the input manager, accessing knowledge about the domain inferred from the current discourse, and fusing all input modalities into a coherent understanding of the users environment; a reactive component for receiving updates from the input manager and understanding module, and using information about the domain and information inferred from the current discourse to determine a current action for said virtual character to perform;” col. 29, lines 26-35: “mapping, in a knowledge base, physical space domain data, and actions by the user within the physical space domain, to an interaction with the virtual environment, and for mapping actions by the virtual character within the virtual environment to said visible representation, such that when displayed on the display device the actions of the virtual character are perceived by the user as interacting with the physical space occupied by the user;” ) (col. 4, lines 14-22: “The reactive processing includes the steps of receiving asynchronous updates of selected of the user inputs and understanding frames concerning the user inputs from said deliberative processing; accessing data from a static knowledge base about a domain and a dynamic knowledge base having inferred information about a current discourse between the user, physical environment, and virtual space; and determining a current action for the virtual space based on the asynchronous updates and data.”  col. 28, lines 8-12: “a knowledge base, for storing physical space domain data, including action inputs by the user within and in relation to said physical space domain, and for further storing actions by the virtual character within the virtual space;”  col. 14, 22-31: “The reaction module sends the generation module frames describing complex actions it would like to have performed immediately, along with priorities and optional time deadlines for completion. The generation module will notify the reaction module when a requested action has either been completed or removed from consideration (e.g., due to timeout or conflict with a higher priority action). In one embodiment, personality, mood, and emotion biases or "hints" are encoded within the frames passed to the generation module and response planner.”),
 the action including both an animation to be performed by the virtual character and associated audio (col. 6, lines 16-30: “Characters developed under the architecture of the present invention have the ability to perceive the user's speech, body position, certain classes of gestures, and general direction of gaze. In response, such characters can speak, make gestures and facial expressions, and direct their gaze appropriately.”); and
  	implementing the determined action by modifying the virtual character (col. 6, lines 16-20: “Characters developed under the architecture of the present invention have the ability to perceive the user's speech, body position, certain classes of gestures, and general direction of gaze. In response, such characters can speak, make gestures and facial expressions, and direct their gaze appropriately.”) in the environment (e.g., col. 3, line 57-58: “within a virtual space of the character”; Abstract: “a virtual space where the character is displayed”; col. 4, lines 4-5: “displaying a virtual space”) presented on the device (e.g., Abstract: “the character is displayed”; col. 27, lines 61-63: “a display device, for use in displaying to the user a visible representation of a computer generated virtual space, including an animated virtual character therein;”) and outputting the associated audio  (col. 3, lines 66-67: “a synthetic character configured to respond to said inputs.”  col. 6, lines 16-30: “Characters developed under the architecture of the present invention have the ability to perceive the user's speech, body position, certain classes of gestures, and general direction of gaze. In response, such characters can speak, make gestures and facial expressions, and direct their gaze appropriately. These characters provide an embodied interface, apparently separate from the underlying system, that the user can simply approach and interact with naturally to gather information or perform some task. An example interaction is shown in FIG. 3, in which the character is explaining how to control the room lighting from a panel display on the podium. Instead of having to read a help menu or other pre-scripted response, the user is interacting with the character which works with the user to solve the problem at hand.”  col. 18, lines 3-5: “The generation module then sends the action scheduler low-level action requests for directly controlling the character animation”;    col. 28, lines 28-34: “a generation module for use in realizing a complex action request from the reactive component by producing one or more coordinated primitive actions, and sending the actions to an action scheduler for performance; and, an action scheduler for taking multiple action requests from the reaction and generation modules and performing out said requests.”  col. 13, lines 7-21: “Actions can be of three different forms: immediate primitive, immediate complex and future. Immediate primitive actions, which are those that need to be executed immediately and consist only of low-level actions (e.g., animation commands) are sent directly to the action scheduler 530. Immediate complex actions, which are those that need to be executed immediately but require elaboration, such as communicative actions, are sent to the generation module 630 for realization. Future actions, which are those that involve the planning of a sequence of actions, some or all of which are to be executed during a later update cycle, are sent to the response planner 620 for elaboration. The response planner 620 can return a plan (sequence of actions) to the reaction module, which the reaction module then caches for future reference.” ). 
 	Thus, in order to obtain a more versatile system for controlling a virtual character having the cumulative features and/or functionalities taught by SHUKLA and PREVOST, it would have been obvious to one of ordinary skill in the art to have modified the method for controlling a virtual character taught by SHUKLA to also incorporate implementing the determined action by modifying the virtual character in the environment presented on the device, as is taught by PREVOST.
 	Regarding claim 2 (depends on claim 1), SHUKLA discloses: 
	the at least two internal models includes a speech recognition model capable of parsing a speech sentiment from the speech information (¶ [0082]: “mental (e.g., certain emotion such as stress of the user estimated based on, e.g., the tune of the speech, a facial expression, or a gesture of the user).”  ¶ [0089]: “recognize various emotions of a party based on both visual information from a camera and the synchronized audio information. For instance, a happy emotion may often be accompanied with a smile face and a certain acoustic cue.”) and a facial feature recognition model capable of detecting a facial feature sentiment based on the facial expression information (¶ [0082]: “mental (e.g., certain emotion such as stress of the user estimated based on, e.g., the tune of the speech, a facial expression, or a gesture of the user).”  ¶ [0089]: “recognize various emotions of a party based on both visual information from a camera and the synchronized audio information. For instance, a happy emotion may often be accompanied with a smile face and a certain acoustic cue.”) (¶ [0082]: “Processed features of the multi-modal data may be further processed at layer 2 to achieve language understanding and/or multi-modal data understanding including visual, textual, and any combination thereof. Some of such understanding may be directed to a single modality, such as speech understanding, and some may be directed to an understanding of the surrounding of the user engaging in a dialogue based on integrated information. Such understanding may be physical (e.g., recognize certain objects in the scene), perceivable (e.g., recognize what the user said, or certain significant sound, etc.), or mental (e.g., certain emotion such as stress of the user estimated based on, e.g., the tune of the speech, a facial expression, or a gesture of the user).”  ¶ [0089] On the input side, the processing level may include speech processing module for performing, e.g., speech recognition based on audio signal obtained from an audio sensor (microphone) to understand what is being uttered in order to determine how to respond. The audio signal may also be recognized to generate text information for further analysis. The audio signal from the audio sensor may also be used by an emotion recognition processing module. The emotion recognition module may be designed to recognize various emotions of a party based on both visual information from a camera and the synchronized audio information. For instance, a happy emotion may often be accompanied with a smile face and a certain acoustic cue. The text information obtained via speech recognition may also be used by the emotion recognition module, as a part of the indication of the emotion, to estimate the emotion involved.”  ¶ [0066]: “In layer 1, multi-modal data may be acquired via sensors such as camera, microphone, keyboard, display, speakers, chat bubble, renderer, or other user interface elements. Such multi-modal data may be analyzed to estimated or infer various features that may be used to infer higher level characteristics such as expression, characters, gesture, emotion, action, attention, intent, etc.),
 	wherein the selected characteristic is a sentiment common among the speech sentiment and the facial feature sentiment (e.g., ¶ [0056]: “emotion/mindset of the user”;  ¶ [0056]: “the user appears to be bored and become impatient”; ¶ [0072]: “the user appears sad, not smiling, the user's speech is slow with a low voice”;  ¶ [0089]: “recognize various emotions of a party based on both visual information from a camera and the synchronized audio information.”  ¶ [0089]: “a happy emotion may often be accompanied with a smile face and a certain acoustic cue.”) (¶ [0054]: “As illustrated, during a human machine dialogue, a user, as the human conversant in the dialogue, may communicate across the network 120 with an agent device or the user interaction engine 140. Such communication may involve data in multiple modalities such as audio, video, text, etc. Via a user device, a user can send data (e.g., a request, audio signal representing an utterance of the user, or a video of the scene surrounding the user) and/or receive data (e.g., text or audio response from an agent device). In some embodiments, user data in multiple modalities, upon being received by an agent device or the user interaction engine 140, may be analyzed to understand the human user's speech or gesture so that the user's emotion or intent may be estimated and used to determine a response to the user.”   ¶ [0056]: “In some embodiments, such inputs from the user's site and processing results thereof may also be transmitted to the user interaction engine 140 for facilitating the user interaction engine 140 to better understand the specific situation associated with the dialogue so that the user interaction engine 140 may determine the state of the dialogue, emotion/mindset of the user, and to generate a response that is based on the specific situation of the dialogue and the intended purpose of the dialogue (e.g., for teaching a child the English vocabulary). For example, if information received from the user device indicates that the user appears to be bored and become impatient, the user interaction engine 140 may determine to change the state of dialogue to a topic that is of interest of the user (e.g., based on the information from the user information database 130) in order to continue to engage the user in the conversation.”   ¶ [0046]: “The present teaching aims to address the deficiencies of the traditional human machine dialogue systems and to provide methods and systems that enables a more effective and realistic human to machine dialogue. The present teaching incorporates artificial intelligence in an automated companion with an agent device in conjunction with the backbone support from a user interaction engine so that the automated companion can conduct a dialogue based on continuously monitored multimodal data indicative of the surrounding of the dialogue, adaptively estimating the mindset/emotion/intent of the participants of the dialogue, and adaptively adjust the conversation strategy based on the dynamically changing information/estimates/contextual information.”  ¶ [0066]: “In layer 1, multi-modal data may be acquired via sensors such as camera, microphone, keyboard, display, speakers, chat bubble, renderer, or other user interface elements. Such multi-modal data may be analyzed to estimated or infer various features that may be used to infer higher level characteristics such as expression, characters, gesture, emotion, action, attention, intent, etc. Such higher level characteristics may be obtained by processing units at layer 2 and the used by components of higher layers, via the internal API as shown in FIG. 4A, to e.g., intelligently infer or estimate additional information related to the dialogue at higher conceptual levels. For example, the estimated emotion, attention, or other characteristics of a participant of a dialogue obtained at layer 2 may be used to estimate the mindset of the participant. In some embodiments, such mindset may also be estimated at layer 4 based on additional information, e.g., recorded surrounding environment or other auxiliary information in such surrounding environment such as sound.”  ¶ [0072]: “Based on acquired multi-modal data, analysis may be performed by the automated companion (e.g., by the front end user device or by the backend user interaction engine 140) to assess the attitude, emotion, mindset, and utility of the users. For example, based on visual data analysis, the automated companion may detect that the user appears sad, not smiling, the user's speech is slow with a low voice. The characterization of the user's states in the dialogue may be performed at layer 2 based on multi-model data acquired at layer 1. Based on such detected observations, the automated companion may infer (at 406) that the user is not that interested in the current topic and not that engaged. Such inference of emotion or mental state of the user may, for instance, be performed at layer 4 based on characterization of the multi-modal data associated with the user.”  ¶ [0085]: “The form in which the response is to be delivered may be determined based on information from multiple sources, e.g., the user's emotion (e.g., if the user is a child who is not happy, the response may be rendered in a gentle voice)”  ¶ [0089]: “On the input side, the processing level may include speech processing module for performing, e.g., speech recognition based on audio signal obtained from an audio sensor (microphone) to understand what is being uttered in order to determine how to respond. The audio signal may also be recognized to generate text information for further analysis. The audio signal from the audio sensor may also be used by an emotion recognition processing module. The emotion recognition module may be designed to recognize various emotions of a party based on both visual information from a camera and the synchronized audio information. For instance, a happy emotion may often be accompanied with a smile face and a certain acoustic cue. The text information obtained via speech recognition may also be used by the emotion recognition module, as a part of the indication of the emotion, to estimate the emotion involved.”), and 
  	wherein the determined action is determined based on the sentiment (¶ [0054]: “the user's emotion or intent may be estimated and used to determine a response to the user.”;  ¶ [0056]: “determine the state of the dialogue, emotion/mindset of the user, and to generate a response that is based on the specific situation of the dialogue”) (¶ [0054]: “As illustrated, during a human machine dialogue, a user, as the human conversant in the dialogue, may communicate across the network 120 with an agent device or the user interaction engine 140. Such communication may involve data in multiple modalities such as audio, video, text, etc. Via a user device, a user can send data (e.g., a request, audio signal representing an utterance of the user, or a video of the scene surrounding the user) and/or receive data (e.g., text or audio response from an agent device). In some embodiments, user data in multiple modalities, upon being received by an agent device or the user interaction engine 140, may be analyzed to understand the human user's speech or gesture so that the user's emotion or intent may be estimated and used to determine a response to the user.”   ¶ [0056]: “In some embodiments, such inputs from the user's site and processing results thereof may also be transmitted to the user interaction engine 140 for facilitating the user interaction engine 140 to better understand the specific situation associated with the dialogue so that the user interaction engine 140 may determine the state of the dialogue, emotion/mindset of the user, and to generate a response that is based on the specific situation of the dialogue and the intended purpose of the dialogue (e.g., for teaching a child the English vocabulary). For example, if information received from the user device indicates that the user appears to be bored and become impatient, the user interaction engine 140 may determine to change the state of dialogue to a topic that is of interest of the user (e.g., based on the information from the user information database 130) in order to continue to engage the user in the conversation.”  ¶ [0046]: “The present teaching aims to address the deficiencies of the traditional human machine dialogue systems and to provide methods and systems that enables a more effective and realistic human to machine dialogue. The present teaching incorporates artificial intelligence in an automated companion with an agent device in conjunction with the backbone support from a user interaction engine so that the automated companion can conduct a dialogue based on continuously monitored multimodal data indicative of the surrounding of the dialogue, adaptively estimating the mindset/emotion/intent of the participants of the dialogue, and adaptively adjust the conversation strategy based on the dynamically changing information/estimates/contextual information.”   ¶ [0067]: “The estimated mindsets of parties, whether related to humans or the automated companion (machine), may be relied on by the dialogue management at layer 3, to determine, e.g., how to carry on a conversation with a human conversant.”  ¶ [0073]: “To respond to the user's current state (not engaged), the automated companion may determine to perk up the user in order to better engage the user. In this illustrated example, the automated companion may leverage what is available in the conversation environment by uttering a question to the user at 408: “Would you like to play a game?” Such a question may be delivered in an audio form as speech by converting text to speech, e.g., using customized voices individualized for the user. In this case, the user may respond by uttering, at 410, “Ok.” Based on the continuously acquired multi-model data related to the user, it may be observed, e.g., via processing at layer 2, that in response to the invitation to play a game, the user's eyes appear to be wandering, and in particular that the user's eyes may gaze towards where the basketball is located. At the same time, the automated companion may also observe that, once hearing the suggestion to play a game, the user's facial expression changes from “sad” to “smiling.” Based on such observed characteristics of the user, the automated companion may infer, at 412, that the user is interested in basketball.”  ¶ [0074]: “Based on the acquired new information and the inference based on that, the automated companion may decide to leverage the basketball available in the environment to make the dialogue more engaging for the user yet still achieving the educational goal for the user. In this case, the dialogue management at layer 3 may adapt the conversion to talk about a game and leverage the observation that the user gazed at the basketball in the room to make the dialogue more interesting to the user yet still achieving the goal of, e.g., educating the user. In one example embodiment, the automated companion generates a response, suggesting the user to play a spelling game” (at 414) and asking the user to spell the word “basketball.””  ¶ [0075]: “Given the adaptive dialogue strategy of the automated companion in light of the observations of the user and the environment, the user may respond providing the spelling of word “basketball.” (at 416). Observations are continuously made as to how enthusiastic the user is in answering the spelling question. If the user appears to respond quickly with a brighter attitude, determined based on, e.g., multi-modal data acquired when the user is answering the spelling question, the automated companion may infer, at 418, that the user is now more engaged. To further encourage the user to actively participate in the dialogue, the automated companion may then generate a positive response “Great job!” with instruction to deliver this response in a bright, encouraging, and positive voice to the user.”  ¶ [0083]: “The multimodal data understanding generated at layer 2 may be used by DM 510 to determine how to respond. To enhance engagement and user experience, the DM 510 may also determine a response based on the estimated mindsets of the user and of the agent from layer 4 as well as the utilities of the user engaged in the dialogue from layer 5. The mindsets of the parties involved in a dialogue may be estimated based on information from Layer 2 (e.g., estimated emotion of a user) and the progress of the dialogue. In some embodiments, the mindsets of a user and of an agent may be estimated dynamically during the course of a dialogue and such estimated mindsets may then be used to learn, together with other data, utilities of users.”  ¶ [0085]: “An output of DM 510 corresponds to an accordingly determined response to the user. To deliver a response to the user, the DM 510 may also formulate a way that the response is to be delivered. The form in which the response is to be delivered may be determined based on information from multiple sources, e.g., the user's emotion (e.g., if the user is a child who is not happy, the response may be rendered in a gentle voice), the user's utility (e.g., the user may prefer speech in certain accent similar to his parents'), or the surrounding environment that the user is in (e.g., noisy place so that the response needs to be delivered in a high volume). DM 510 may output the response determined together with such delivery parameters.”  ¶ [0090]: “On the output side of the processing level, when a certain response strategy is determined, such strategy may be translated into specific actions to take by the automated companion to respond to the other party. Such action may be carried out by either deliver some audio response or express certain emotion or attitude via certain gesture. When the response is to be delivered in audio, text with words that need to be spoken are processed by a text to speech module to produce audio signals and such audio signals are then sent to the speakers to render the speech as a response. In some embodiments, the speech generated based on text may be performed in accordance with other parameters, e.g., that may be used to control to generate the speech with certain tones or voices. If the response is to be delivered as a physical action, such as a body movement realized on the automated companion, the actions to be taken may also be instructions to be used to generate such body movement.”).
 	Regarding claim 3 (depends on claim 1), SHUKLA discloses:
 	the at least two internal models include a prior knowledge model (e.g., ¶ [0064]: “a hierarchy of preferences”) capable of retrieving prior knowledge information comprising information relating to previous engagement with a user (e.g., ¶ [0064]: “such a hierarchy of preferences may evolve over time based on the party's choices made and likings exhibited during conversations.”) (¶ [0064]: “The term “utility” is hereby defined as preferences of a party identified based on states detected associated with dialogue histories. Utility may be associated with a party in a dialogue, whether the party is a human, the automated companion, or other intelligent devices. A utility for a particular party may represent different states of a world, whether physical, virtual, or even mental. For example, a state may be represented as a particular path along which a dialog walks through in a complex map of the world. At different instances, a current state evolves into a next state based on the interaction between multiple parties. States may also be party dependent, i.e., when different parties participate in an interaction, the states arising from such interaction may vary. A utility associated with a party may be organized as a hierarchy of preferences and such a hierarchy of preferences may evolve over time based on the party's choices made and likings exhibited during conversations. Such preferences, which may be represented as an ordered sequence of choices made out of different options, is what is referred to as utility. The present teaching discloses method and system by which an intelligent automated companion is capable of learning, through a dialogue with a human conversant, the user's utility.”  ¶ [0083]: “The multimodal data understanding generated at layer 2 may be used by DM 510 to determine how to respond. To enhance engagement and user experience, the DM 510 may also determine a response based on the estimated mindsets of the user and of the agent from layer 4 as well as the utilities of the user engaged in the dialogue from layer 5. The mindsets of the parties involved in a dialogue may be estimated based on information from Layer 2 (e.g., estimated emotion of a user) and the progress of the dialogue. In some embodiments, the mindsets of a user and of an agent may be estimated dynamically during the course of a dialogue and such estimated mindsets may then be used to learn, together with other data, utilities of users. The learned utilities represent preferences of users in different dialogue scenarios and are estimated based on historic dialogues and the outcomes thereof.”  ),
 wherein the selected characteristic (¶ [0083]: “how to respond”) is selected based on the prior knowledge information processed using the prior knowledge model (¶ [0083]: “The multimodal data understanding generated at layer 2 may be used by DM 510 to determine how to respond. To enhance engagement and user experience, the DM 510 may also determine a response based on the estimated mindsets of the user and of the agent from layer 4 as well as the utilities of the user engaged in the dialogue from layer 5. The mindsets of the parties involved in a dialogue may be estimated based on information from Layer 2 (e.g., estimated emotion of a user) and the progress of the dialogue. In some embodiments, the mindsets of a user and of an agent may be estimated dynamically during the course of a dialogue and such estimated mindsets may then be used to learn, together with other data, utilities of users. The learned utilities represent preferences of users in different dialogue scenarios and are estimated based on historic dialogues and the outcomes thereof.”  ¶ [0084]: “In each dialogue of a certain topic, the dialogue manager 510 bases its control of the dialogue on relevant dialogue tree(s) that may or may not be associated with the topic (e.g., may inject small talks to enhance engagement). To generate a response to a user in a dialogue, the dialogue manager 510 may also consider additional information such as a state of the user, the surrounding of the dialogue scene, the emotion of the user, the estimated mindsets of the user and the agent, and the known preferences of the user (utilities).”).
	Regarding claim 5 (depends on claim 1), SHUKLA discloses the method further comprises: 
 	instructing the virtual character to perform an initial action representing a query to a user on the device ( ) (¶ [0071]: “FIG. 4C illustrates exemplary a human-agent device interaction and exemplary processing performed by the automated companion, according to an embodiment of the present teaching. As seen from FIG. 4C, operations at different layers may be conducted and together they facilitate intelligent dialogue in a cooperated manner. In the illustrated example, an agent device may first ask a user “How are you doing today?” at 402 to initiate a conversation. In response to utterance at 402, the user may respond with utterance “Ok” at 404. To manage the dialogue, the automated companion may activate different sensors during the dialogue to make observation of the user and the surrounding environment. For example, the agent device may acquire multi-modal data about the surrounding environment where the user is in. Such multi-modal data may include audio, visual, or text data. For example, visual data may capture the facial expression of the user. The visual data may also reveal contextual information surrounding the scene of the conversation. For instance, a picture of the scene may reveal that there is a basketball, a table, and a chair, which provides information about the environment and may be leveraged in dialogue management to enhance engagement of the user. Audio data may capture not only the speech response of the user but also other peripheral information such as the tone of the response, the manner by which the user utters the response, or the accent of the user.” ¶ [0073]: “To respond to the user's current state (not engaged), the automated companion may determine to perk up the user in order to better engage the user. In this illustrated example, the automated companion may leverage what is available in the conversation environment by uttering a question to the user at 408: “Would you like to play a game?” Such a question may be delivered in an audio form as speech by converting text to speech, e.g., using customized voices individualized for the user. In this case, the user may respond by uttering, at 410, “Ok.” Based on the continuously acquired multi-model data related to the user, it may be observed, e.g., via processing at layer 2, that in response to the invitation to play a game, the user's eyes appear to be wandering, and in particular that the user's eyes may gaze towards where the basketball is located. At the same time, the automated companion may also observe that, once hearing the suggestion to play a game, the user's facial expression changes from “sad” to “smiling.” Based on such observed characteristics of the user, the automated companion may infer, at 412, that the user is interested in basketball.”),
 wherein the input information represents a response by the user to the query ( ) ( ¶ [0071]: “FIG. 4C illustrates exemplary a human-agent device interaction and exemplary processing performed by the automated companion, according to an embodiment of the present teaching. As seen from FIG. 4C, operations at different layers may be conducted and together they facilitate intelligent dialogue in a cooperated manner. In the illustrated example, an agent device may first ask a user “How are you doing today?” at 402 to initiate a conversation. In response to utterance at 402, the user may respond with utterance “Ok” at 404. To manage the dialogue, the automated companion may activate different sensors during the dialogue to make observation of the user and the surrounding environment. For example, the agent device may acquire multi-modal data about the surrounding environment where the user is in. Such multi-modal data may include audio, visual, or text data. For example, visual data may capture the facial expression of the user. The visual data may also reveal contextual information surrounding the scene of the conversation. For instance, a picture of the scene may reveal that there is a basketball, a table, and a chair, which provides information about the environment and may be leveraged in dialogue management to enhance engagement of the user. Audio data may capture not only the speech response of the user but also other peripheral information such as the tone of the response, the manner by which the user utters the response, or the accent of the user.”  ¶ [0073]: “To respond to the user's current state (not engaged), the automated companion may determine to perk up the user in order to better engage the user. In this illustrated example, the automated companion may leverage what is available in the conversation environment by uttering a question to the user at 408: “Would you like to play a game?” Such a question may be delivered in an audio form as speech by converting text to speech, e.g., using customized voices individualized for the user. In this case, the user may respond by uttering, at 410, “Ok.” Based on the continuously acquired multi-model data related to the user, it may be observed, e.g., via processing at layer 2, that in response to the invitation to play a game, the user's eyes appear to be wandering, and in particular that the user's eyes may gaze towards where the basketball is located. At the same time, the automated companion may also observe that, once hearing the suggestion to play a game, the user's facial expression changes from “sad” to “smiling.” Based on such observed characteristics of the user, the automated companion may infer, at 412, that the user is interested in basketball.”). 
  	Claim 4 is rejected under 35 U.S.C. 103 as being unpatentable over SHUKLA et al. (US 2019/0279642) in view of PREVOST et al. (US 6,570,555), further in view of JOHNSON et al. (US 2007/0015121, hereinafter “JOHNSON”).
 	Regarding claim 4 (depends on claim 1), SHUKLA discloses:
 	the internal models include a natural language understanding model configured to derive context and meaning from audio information (¶ [0082]: “achieve language understanding”;  ¶ [0082]: “speech understanding”;  ¶ [0082]: “recognize what the user said”  ¶ [0089]: “On the input side, the processing level may include speech processing module for performing, e.g., speech recognition based on audio signal obtained from an audio sensor (microphone) to understand what is being uttered in order to determine how to respond. The audio signal may also be recognized to generate text information for further analysis. The audio signal from the audio sensor may also be used by an emotion recognition processing module. The emotion recognition module may be designed to recognize various emotions of a party based on both visual information from a camera and the synchronized audio information. For instance, a happy emotion may often be accompanied with a smile face and a certain acoustic cue. The text information obtained via speech recognition may also be used by the emotion recognition module, as a part of the indication of the emotion, to estimate the emotion involved.”  ¶ [0093]: “Another practical challenge to speech recognition in user machine dialogues is to determine the language in which the user is speaking in order for the automated dialogue companion to determine a recognition strategy, e.g., which speech recognition model to be used, to understand the user's utterances and determine responses thereof.”),
 an awareness model configured to identify environmental information (¶ [0053]: “surround information of the conversations”;  ¶ [0082]: “an understanding of the surrounding of the user”;  ¶ [0082]: “understanding may be physical (e.g., recognize certain objects in the scene)”;  ¶ [0085]: “the surrounding environment that the user is in (e.g., noisy place so that the response needs to be delivered in a high volume).”) (¶ [0053]: “Generally speaking, the user interaction engine 140 may control the state and the flow of conversations between users and agent devices. The flow of each of the conversations may be controlled based on different types of information associated with the conversation, e.g., information about the user engaged in the conversation (e.g., from the user information database 130), the conversation history, surround information of the conversations, and/or the real time user feedbacks.”   ¶ [0071]: “The visual data may also reveal contextual information surrounding the scene of the conversation. For instance, a picture of the scene may reveal that there is a basketball, a table, and a chair, which provides information about the environment and may be leveraged in dialogue management to enhance engagement of the user.”   ¶ [0073]: “In this illustrated example, the automated companion may leverage what is available in the conversation environment by uttering a question to the user at 408: “Would you like to play a game?” Such a question may be delivered in an audio form as speech by converting text to speech, e.g., using customized voices individualized for the user. In this case, the user may respond by uttering, at 410, “Ok.” Based on the continuously acquired multi-model data related to the user, it may be observed, e.g., via processing at layer 2, that in response to the invitation to play a game, the user's eyes appear to be wandering, and in particular that the user's eyes may gaze towards where the basketball is located. At the same time, the automated companion may also observe that, once hearing the suggestion to play a game, the user's facial expression changes from “sad” to “smiling.” Based on such observed characteristics of the user, the automated companion may infer, at 412, that the user is interested in basketball.”  ¶ [0074]: “Based on the acquired new information and the inference based on that, the automated companion may decide to leverage the basketball available in the environment to make the dialogue more engaging for the user yet still achieving the educational goal for the user. In this case, the dialogue management at layer 3 may adapt the conversion to talk about a game and leverage the observation that the user gazed at the basketball in the room to make the dialogue more interesting to the user yet still achieving the goal of, e.g., educating the user. In one example embodiment, the automated companion generates a response, suggesting the user to play a spelling game” (at 414) and asking the user to spell the word “basketball.””   ¶ [0082]: “Processed features of the multi-modal data may be further processed at layer 2 to achieve language understanding and/or multi-modal data understanding including visual, textual, and any combination thereof. Some of such understanding may be directed to a single modality, such as speech understanding, and some may be directed to an understanding of the surrounding of the user engaging in a dialogue based on integrated information. Such understanding may be physical (e.g., recognize certain objects in the scene), perceivable (e.g., recognize what the user said, or certain significant sound, etc.), or mental (e.g., certain emotion such as stress of the user estimated based on, e.g., the tune of the speech, a facial expression, or a gesture of the user).”).
 	SHUKLA and PREVOST fail to disclose: “a social simulation model configured to identify data relating to a user and other virtual characters.”
  	However, whereas SHUKLA and PREVOST may not be entirely explicit as to,  JOHNSON teaches: 
 	a social simulation model (e.g., ¶ [0067]: “social simulation engine 52” ) configured to identify data (e.g.,  ¶ [0067]: “how each character in the game should respond to the learner's action.”; ¶ [0078]: “a summary of the current level of learner ability 76”; ¶ [0078]: “elements of the characters in the social simulation and their behavior”) relating to a user (e.g., ¶ [0067]: “the learner 41“ ) and other virtual characters (e.g., ¶ [0067]: “other characters “) (¶ [0067]: “A mission engine module 48 may control the characters in the game world, and determine their responses to the actions of the learner 41 and to other characters. An input manager 50 may interpret an utterance hypothesis 49 and nonverbal behavior 44 of the learner 41, and produce a learner communicative act description 51 that may describe the content of the utterance hypothesis 49 and the meaning of the nonverbal behaviors 44. Communicative acts may be similar to speech acts as commonly defined in linguistics and philosophy of language, but may allow for communication to occur through nonverbal means, as well as through speech. A social simulation engine 52 may then determine how each character in the game should respond to the learner's action.” ¶ [0078]: “FIG. 8 is a data flow diagram illustrating processing components used in a social simulation engine, within an interactive social simulation module, together with data exchanged between module components. The social simulation engine may be initialized with a summary of the current level of learner ability 76 and the current skills/mission 77. The learner ability 76 may be retrieved from the learner model 18, and the skills/missions 77 may be retrieved from social interaction content specifications 126 that may describe elements of the characters in the social simulation and their behavior. The learner ability 76 may include the learner's level of mastery of individual skills, and game parameters that determine the level of difficulty of game play, such as whether the learner is a beginner or an experienced player, and whether or not the player should be provided with assistance such as subtitles. The skills/missions 77 description may include a description of the initial state of the scene, the task objectives 89 to be completed in the scene, and/or the skills needed to complete mission objectives.”    ¶ [0084]: “The social simulation may be organized into a set of scenes or situations. For example, in one scene a group of agents might be sitting at a table in a cafe; in another situation an agent playing the role of policeman might be standing in a traffic police kiosk; in yet another scene an agent playing the role of sheikh might be sitting in his living room with his family. In each scene or situation each agent may have a repertoire of communicative acts available to it, appropriate to that scene.”   ¶ [0089]: “FIG. 9 is a screen displaying a virtual aide (a component of a social simulation module) advising learner on what action to perform. The social simulation game may include a special agent: a virtual aide 91, which may provide help and assistance to a learner 41 (FIG. 7) as he proceeds through the game. The virtual aide 91 may accompany the learner's character 92 as a companion or team member. The virtual aide 91 may provide the learner 41 with advice as to what to do, as in FIG. 9, where the virtual aide 91 is suggesting that the learner 41 introduce himself to one of the townspeople, as reflected in the statement 93 "Introduce yourself to the man" in the native language of the learner 41. The virtual aide 91 may also translate for the learner 41 if he or she is having difficulty understanding what a game character is saying. The virtual aide 91 may also play a role within the game, responding to actions of other characters 94 or 95 or of the learner 41.”  ¶ [0091]: “As shown in FIG. 8, the social puppet manager 81 may be responsible for coordinating the verbal and nonverbal conduct of agents in conversational groups according to a certain set of behavior rules. Each agent 54 (FIG. 6) may have a corresponding social puppet 82 in the social puppet manager 81. The social puppet manager 81 may choose a communicative function 83 for each agent character to perform, and the social puppet 82 may then determine what communicative behaviors 84 to perform to realize the communicative function 83. These communicative behaviors 84 may then be passed to the action scheduler 57 for execution, which may in turn cause the animated body of the character to perform a combination of body movements in synchronization. Communicative functions may be signaled by other display techniques, such as displaying an image of one character attending to and reacting to the communication of another character (a "reaction shot").”  ¶ [0092]: “FIGS. 10 and 11 are screens displaying characters in a social simulation engaged in communicative behaviors. In FIG. 10, the character 96 signals the communicative function of engaging in the conversation. He does this by performing the communicative behaviors of standing up and facing the player character 97. In FIG. 11, the character 98 performs the communicative function of taking the conversational turn, and characters 99 and 100 perform the communicative function of listening to the character 98. The communicative function of taking the turn is realized by speaking in coordination with gestures such as hand gestures. The communicative function of listening to the speaker is realized by facing and gazing at the speaker.”) 
 	Thus, in order to obtain a more versatile system for controlling a virtual character having the cumulative features and/or functionalities taught by SHUKLA, PREVOST and JOHNSON, it would have been obvious to one of ordinary skill in the art to have modified the method for controlling a virtual character taught by the combination of SHUKLA and PREVOST to also incorporate a social simulation model  configured to identify data relating to a user and other virtual characters, as taught by JOHNSON.
 	Claim 6 is rejected under 35 U.S.C. 103 as being unpatentable over SHUKLA et al. (US 2019/0279642) in view of PREVOST et al. (US 6,570,555), further in view of ORVALHO et al. (US 2019/0279410, hereinafter “ORVALHO”).
 	Regarding claim 6 (depends on claim 1),  whereas neither SHUKLA nor PREVOST is explicit as to, ORVALHO clearly teaches: 
 	sharing an embedded link (¶ [0006]: “generating a selectable link for transmission as part of an electronic message”;  ¶ [0075[: “providing a link to the model to be included in an electronic message to another user”) to a plurality of users (¶ [0066]: “other client devices.”  ¶ [0085]: “The second user may be more than one user since, in some embodiments, the instant message in the example may be sent (with the link included) to multiple recipient users of the player application 720.”) via a network ( ¶ [0066]: “via a network 606 (e.g., the Internet).”) (¶ [0006]: “a method for creating a customized animatable 3D model for use in an electronic communication between at least two users, the method comprising: receiving input from a first user, the first user using a mobile device, the input being in the form of at least one of an audio stream and a visual stream, the visual stream including at least one image or video; and based on an animatable 3D model and the at least one of the audio stream and the visual stream, automatically generating a dynamically customized animation of the animatable 3D model of a virtual character corresponding to the first user, the generating of the dynamically customized animation comprising performing dynamic conversion of the input, in the form of the at least one of the audio stream and the visual stream, into an expression stream and corresponding time information. The method may further include generating a selectable link for transmission as part of an electronic message, the selectable link linking to the expression stream and the corresponding time information; and causing display of the dynamically customized animatable 3D model to the second user. The generating of the selectable link and the causing display may be automatically performed or performed in response to user action.”  ¶ [0066]: “The messaging system 600 may include multiple client devices 602, each of which hosts a number of applications including a messaging client application 604. Each messaging client application 604 may be communicatively coupled to other instances of the messaging client application 604 and a messaging server system 608 via a network 606 (e.g., the Internet). As used herein, the term “client device” may refer to any machine that interfaces to a communications network (such as network 606) to obtain resources from one or more server systems or other client devices.”   ¶ [0075[: “creating a dynamically customized animatable 3D model of a virtual character for a user and providing a link to the model to be included in an electronic message to another user.”  ¶ [0079]: “provide the link to the customized animatable 3D model, the link being part of an instant message, for example, that a recipient of the instant message can click on or otherwise select, and have the dynamically customized 3D animatable model being displayed to the recipient. The link may be to a location in the cloud-based system, so that the link can be provided in an instant message, for example, so the recipient can view a 3D model avatar that automatically and dynamically mimics the determined movements of the sender.” );
 	receiving a selection (¶ [0085]: “in response to selection of the selectable link by the second user”) from any of a set of devices (¶ [0085]: “The second user may be more than one user since, in some embodiments, the instant message in the example may be sent (with the link included) to multiple recipient users”) indicating that the embedded link has been selected (¶ [0083]: “The selectable link 724 in the electronic message 726 may link to the expression stream and the corresponding time information 718. This may be a link to a cloud computing system (e.g., cloud 728) to which the expression stream and the corresponding time information 718 was transmit or streamed.”   NOTE: In order for the expression stream and the corresponding time information 718 to be transmitted or streamed from the cloud 728 (FIG. 7) in response to selection of the selectable link by the second user, the cloud 728 must, by necessity, receive an indication of the selection of the selectable link by the second user, and, as such, “receiving a selection” is inherent.) (¶ [0006]: “a method for creating a customized animatable 3D model for use in an electronic communication between at least two users, the method comprising: receiving input from a first user, the first user using a mobile device, the input being in the form of at least one of an audio stream and a visual stream, the visual stream including at least one image or video; and based on an animatable 3D model and the at least one of the audio stream and the visual stream, automatically generating a dynamically customized animation of the animatable 3D model of a virtual character corresponding to the first user, the generating of the dynamically customized animation comprising performing dynamic conversion of the input, in the form of the at least one of the audio stream and the visual stream, into an expression stream and corresponding time information. The method may further include generating a selectable link for transmission as part of an electronic message, the selectable link linking to the expression stream and the corresponding time information; and causing display of the dynamically customized animatable 3D model to the second user. The generating of the selectable link and the causing display may be automatically performed or performed in response to user action.”   ¶ [0079]: “provide the link to the customized animatable 3D model, the link being part of an instant message, for example, that a recipient of the instant message can click on or otherwise select, and have the dynamically customized 3D animatable model being displayed to the recipient. The link may be to a location in the cloud-based system, so that the link can be provided in an instant message, for example, so the recipient can view a 3D model avatar that automatically and dynamically mimics the determined movements of the sender.” ¶ [0084]: “In the example in FIG. 7, the path 730 shows at a high level, from the standpoint of a first user 734 and a second user (not shown for space reasons), an instant message 726 including a link 724 plus other content in the instant message 732 which may be included by the first user 734.”  ¶ [0085]: “At the play 708 stage, automatically or in response to selection of the selectable link by the second user who received the electronic message via a player application 720 on a mobile device 722, causing display of the dynamically customized animatable 3D model to the second user. The second user may be more than one user since, in some embodiments, the instant message in the example may be sent (with the link included) to multiple recipient users of the player application 720.”); and
 	responsive to receiving the selection (¶ [0085]: “in response to selection of the selectable link by the second user who received the electronic message”), transmitting a stream of data (¶ [0079]: “an expression stream and corresponding time information (info) 718”;  FIG. 7: As is clearly shown in FIG. 7, the “Expression Stream + Time Info” is transmitted through cloud 728 to mobile device 722.) to the device of the set of devices that sent the selection (¶ [0081]: “the recipient user’s device”;  ¶ [0085]: “to the second user.” ) to display the virtual character on the device (¶ [0085]: “causing display of the dynamically customized animatable 3D model to the second user.”) (¶ [0006]: “a method for creating a customized animatable 3D model for use in an electronic communication between at least two users, the method comprising: receiving input from a first user, the first user using a mobile device, the input being in the form of at least one of an audio stream and a visual stream, the visual stream including at least one image or video; and based on an animatable 3D model and the at least one of the audio stream and the visual stream, automatically generating a dynamically customized animation of the animatable 3D model of a virtual character corresponding to the first user, the generating of the dynamically customized animation comprising performing dynamic conversion of the input, in the form of the at least one of the audio stream and the visual stream, into an expression stream and corresponding time information. The method may further include generating a selectable link for transmission as part of an electronic message, the selectable link linking to the expression stream and the corresponding time information; and causing display of the dynamically customized animatable 3D model to the second user. The generating of the selectable link and the causing display may be automatically performed or performed in response to user action.”   ¶ [0066]: “The messaging system 600 may include multiple client devices 602, each of which hosts a number of applications including a messaging client application 604. Each messaging client application 604 may be communicatively coupled to other instances of the messaging client application 604 and a messaging server system 608 via a network 606 (e.g., the Internet). As used herein, the term “client device” may refer to any machine that interfaces to a communications network (such as network 606) to obtain resources from one or more server systems or other client devices.”  ¶ [0067]: “In the example shown in FIG. 6, each messaging client application 604 is able to communicate and exchange data with another messaging client application 604 and with the messaging server system 608 via the network 606. The data exchanged between messaging client applications 604, and between a messaging client application 604 and the messaging server system 608, may include functions (e.g., commands to invoke functions) as well as payload data (e.g., text, audio, video or other multimedia data).”  ¶ [0079]: “provide the link to the customized animatable 3D model, the link being part of an instant message, for example, that a recipient of the instant message can click on or otherwise select, and have the dynamically customized 3D animatable model being displayed to the recipient. The link may be to a location in the cloud-based system, so that the link can be provided in an instant message, for example, so the recipient can view a 3D model avatar that automatically and dynamically mimics the determined movements of the sender.”   ¶ [0082]: “send that customized 3D animatable model to the recipient, e.g., via a selectable link, in an instant message via an instant messaging service,”  ¶ [0083]: “For the send 706 stage, the method may further include, automatically or in response to an action from the first user, generating a selectable link (e.g., 724) for transmission as part of an electronic message (e.g., instant message 726). The selectable link 724 in the electronic message 726 may link to the expression stream and the corresponding time information 718. This may be a link to a cloud computing system (e.g., cloud 728) to which the expression stream and the corresponding time information 718 was transmit or streamed.”  ¶ [0085]: “At the play 708 stage, automatically or in response to selection of the selectable link by the second user who received the electronic message via a player application 720 on a mobile device 722, causing display of the dynamically customized animatable 3D model to the second user. The second user may be more than one user since, in some embodiments, the instant message in the example may be sent (with the link included) to multiple recipient users of the player application 720.”  ¶ [0026]: “introducing the use of animatable 3D models of virtual characters (also known as “avatars”) in electronic messaging. Users of the electronic messaging can be represented by the animatable 3D models.”).
 	Thus, in order to obtain a more versatile method/system for controlling and displaying a virtual character having the cumulative features and/or functionalities taught by SHUKLA, PREVOST and ORVALHO, it would have been obvious to one of ordinary skill in the art to have modified the method for controlling a virtual character taught by the combination of SHUKLA and PREVOST to also incorporate sharing an embedded link to a plurality of users via a network, receiving a selection from any of a set of devices indicating that the link has been selected, and responsive to receiving the selection, transmitting the stream of data to the user device of the set of devices that sent the selection to display the virtual character on the user device, as is clearly taught in the virtual character animation instant messaging method disclosed by ORVALHO.
	Claim 7 is rejected under 35 U.S.C. 103 as being unpatentable over SHUKLA et al. (US 2019/0279642) in view of PREVOST et al. (US 6,570,555), further in view of ORVALHO et al. (US 2019/0279410), further still in view of DIRKSEN et al. (US 2019/0035132, hereinafter “DIRKSEN”).
	Regarding claim 7 (depends on claim 6), ORVALHO further discloses that the method further comprises: 
 	transmitting a first batch of the stream of data (e.g., ¶ [0079]: “an expression stream and corresponding time information (info) 718,”  ¶ [0079]: “an animation script may be generated based on the expression(s) determined from the input audio and/or visual stream. The animation script may be encoded onto an encoded stream”;  ¶ [0081]: “the created customizable 3D animatable model is downloaded once to the recipient user's device“; ¶ [0081]: “small files”) at a first time (e.g., ¶ [0079]: “an expression stream and corresponding time information (info) 718,”  ¶ [0079]: “in real-time”;   ¶ [0081]: “as the audio and/or visual stream is captured from the message sender”;  ¶ [0052]: “the animatable object is dynamically and automatically generated in real-time based on a dynamic user input, for example from a video signal from a camera system.”  ¶ [0055]: “It is to be understood that each operation of the method 300 may be performed in real-time, such that a dynamic user input such as a video signal is permitted to be input to automatically generate a dynamic 3D model that follows a morphology of the user input in real-time.”)  (¶ [0081]: “the created customizable 3D animatable model is downloaded once to the recipient user's device (e.g., mobile device, PC, laptop, etc.) and as the audio and/or visual stream is captured from the message sender, only small files (as mentioned above) would need to be sent, significantly saving bandwidth, reducing latency, etc.”  ¶ [0082]: “send that customized 3D animatable model to the recipient, e.g., via a selectable link, in an instant message via an instant messaging service,”  claim 14: “the animatable 3D model is customizable such that the customized animatable 3D model can be generated therefrom, and the animatable model is: downloaded, for customization processing, from a cloud-based system to the mobile device of the second user.”  NOTE: In other words, the 3D animated model is initially downloaded and a first batch of the ), 
 the first batch including information to initially generate the virtual character (¶ [0081]: “the created customizable 3D animatable model”) on the display of the device (¶ [0085]: “causing display of the dynamically customized animatable 3D model to the second user.”) (¶ [0079]: “For the convert 704 stage, based on an animatable 3D model and the at least one of an audio stream 712 and the visual stream (see e.g., 714), automatically generating a dynamically customized animation of the animatable 3D model of a virtual character corresponding to the first user, the generating of the dynamically customized animation comprising performing dynamic conversion of the input, in the form of the at least one of an audio stream 712 and a visual stream (see e.g., 714), into an expression stream and corresponding time information (info) 718, using the expression decomposer 716 in this example. In various embodiments, since the animatable 3D model of a virtual character is a computer graphic representation having a geometry or mesh, which may be controlled by a rig or control structure, an animation script may be generated based on the expression(s) determined from the input audio and/or visual stream. The animation script may be encoded onto an encoded stream locally or in the cloud and may be sent to a cloud-based system for performing the customized animation of the animatable 3D model of the user, in real-time.”  ¶ [0081]: “the created customizable 3D animatable model is downloaded once to the recipient user's device (e.g., mobile device, PC, laptop, etc.) and as the audio and/or visual stream is captured from the message sender, only small files (as mentioned above) would need to be sent, significantly saving bandwidth, reducing latency, etc.”  ¶ [0082]: “send that customized 3D animatable model to the recipient, e.g., via a selectable link, in an instant message via an instant messaging service,”  ¶ [0085]: “At the play 708 stage, automatically or in response to selection of the selectable link by the second user who received the electronic message via a player application 720 on a mobile device 722, causing display of the dynamically customized animatable 3D model to the second user.” ); and 
 	transmitting a second batch of the stream of data (e.g., ¶ [0079]: “an expression stream and corresponding time information (info) 718,”  ¶ [0079]: “an animation script may be generated based on the expression(s) determined from the input audio and/or visual stream. The animation script may be encoded onto an encoded stream”;  ¶ [0081]: “small files”) at a second time after the first time (e.g., ¶ [0079]: “an expression stream and corresponding time information (info) 718,”  ¶ [0079]: “in real-time”;   ¶ [0081]: “as the audio and/or visual stream is captured from the message sender”;  ¶ [0052]: “the animatable object is dynamically and automatically generated in real-time based on a dynamic user input, for example from a video signal from a camera system.”  ¶ [0055]: “It is to be understood that each operation of the method 300 may be performed in real-time, such that a dynamic user input such as a video signal is permitted to be input to automatically generate a dynamic 3D model that follows a morphology of the user input in real-time.”)  (¶ [0079]: “For the convert 704 stage, based on an animatable 3D model and the at least one of an audio stream 712 and the visual stream (see e.g., 714), automatically generating a dynamically customized animation of the animatable 3D model of a virtual character corresponding to the first user, the generating of the dynamically customized animation comprising performing dynamic conversion of the input, in the form of the at least one of an audio stream 712 and a visual stream (see e.g., 714), into an expression stream and corresponding time information (info) 718, using the expression decomposer 716 in this example. In various embodiments, since the animatable 3D model of a virtual character is a computer graphic representation having a geometry or mesh, which may be controlled by a rig or control structure, an animation script may be generated based on the expression(s) determined from the input audio and/or visual stream. The animation script may be encoded onto an encoded stream locally or in the cloud and may be sent to a cloud-based system for performing the customized animation of the animatable 3D model of the user, in real-time.  In some embodiments, the animatable 3D model is synced with the user input, and the user input and animation script may be encoded onto an encoded stream that is sent to the cloud-based system to customize movements of the animatable 3D model, and provide the link to the customized animatable 3D model, the link being part of an instant message, for example, that a recipient of the instant message can click on or otherwise select, and have the dynamically customized 3D animatable model being displayed to the recipient. The link may be to a location in the cloud-based system, so that the link can be provided in an instant message, for example, so the recipient can view a 3D model avatar that automatically and dynamically mimics the determined movements of the sender.”  ¶ [0081]: “the created customizable 3D animatable model is downloaded once to the recipient user's device (e.g., mobile device, PC, laptop, etc.) and as the audio and/or visual stream is captured from the message sender, only small files (as mentioned above) would need to be sent, significantly saving bandwidth, reducing latency, etc.”  ¶ [0082]: “send that customized 3D animatable model to the recipient, e.g., via a selectable link, in an instant message via an instant messaging service,” ¶ [0085]: “At the play 708 stage, automatically or in response to selection of the selectable link by the second user who received the electronic message via a player application 720 on a mobile device 722, causing display of the dynamically customized animatable 3D model to the second user.”   NOTE:  Since the generation and display of the customized animation is performed in real-time, via real-time conversion of the input audio and/or visual stream into the encoded expression stream and time information, then after the 3D animatable model is downloaded, and after initial batch of the expression stream and corresponding time information is used to correspondingly animate the 3D model, then subsequent batches of the expression stream and corresponding time information being generated in real-time from the ongoing input of the audio and visual streams by the first user would need to be transmitted to the recipients device to continue animating the animatable 3D model according to the new audio and visual streams data being input by the first user.  In other words, in order for the recipient to “view a 3D model avatar that automatically and dynamically mimics the determined movements of the sender” (¶ [0079]), clearly a second batch of the expression stream and corresponding time information must be transmitted after initially transmitting the animatable 3D model and/or an initial (or earlier) batch of the expression stream and corresponding time information.),
 the second batch including information (e.g., ¶ [0079]: “an animation script may be generated based on the expression(s) determined from the input audio and/or visual stream. The animation script may be encoded onto an encoded stream”) to output a first action by the virtual character (¶ [0079]: “performing the customized animation of the animatable 3D model of the user, in real-time”; ¶ [0055]: “performed in real-time, such that a dynamic user input such as a video signal is permitted to be input to automatically generate a dynamic 3D model that follows a morphology of the user input in real-time.”) (¶ [0006]: “based on an animatable 3D model and the at least one of the audio stream and the visual stream, automatically generating a dynamically customized animation of the animatable 3D model of a virtual character corresponding to the first user, the generating of the dynamically customized animation comprising performing dynamic conversion of the input, in the form of the at least one of the audio stream and the visual stream, into an expression stream and corresponding time information.”  ¶ [0079]: “For the convert 704 stage, based on an animatable 3D model and the at least one of an audio stream 712 and the visual stream (see e.g., 714), automatically generating a dynamically customized animation of the animatable 3D model of a virtual character corresponding to the first user, the generating of the dynamically customized animation comprising performing dynamic conversion of the input, in the form of the at least one of an audio stream 712 and a visual stream (see e.g., 714), into an expression stream and corresponding time information (info) 718, using the expression decomposer 716 in this example. In various embodiments, since the animatable 3D model of a virtual character is a computer graphic representation having a geometry or mesh, which may be controlled by a rig or control structure, an animation script may be generated based on the expression(s) determined from the input audio and/or visual stream. The animation script may be encoded onto an encoded stream locally or in the cloud and may be sent to a cloud-based system for performing the customized animation of the animatable 3D model of the user, in real-time. In some embodiments, the animatable 3D model is synced with the user input, and the user input and animation script may be encoded onto an encoded stream that is sent to the cloud-based system to customize movements of the animatable 3D model,”    ¶ [0081]: “the created customizable 3D animatable model is downloaded once to the recipient user's device (e.g., mobile device, PC, laptop, etc.) and as the audio and/or visual stream is captured from the message sender, only small files (as mentioned above) would need to be sent, significantly saving bandwidth, reducing latency, etc.”  claim 4: “active movements derived from the user input, in the form of the at least one of the audio stream and the visual stream, are used for the conversion into the expression stream and corresponding time information.”  claim 9: “for the generating of the dynamically customized animation comprising the performing dynamic conversion of the input, the dynamic conversion further comprises determining certain movements to apply to the animatable 3D model based on determining the direction, from the visual stream, at least one eye of the first user is looking.”).
 	ORVALHO fails to disclose: “wherein the first batch is discarded at the second time.”
 	However, whereas neither SHUKLA, PREVOST nor ORVALHO is explicit as to, DIRKSEN teaches: 
 	wherein the first batch (¶ [0088]: “a first chunk”… “streamed” … “at VR application startup”) is discarded at the second time (¶ [0088]: “as needed during VR application use, additional chunks can be fetched and” … “chunks no longer needed can be discarded from local storage”) (¶ [0088]: “As mentioned above, character animation data can comprise data pertaining to one or more animation clips. In the example of FIG. 1B, a set of animation clip data includes a simplified rig's joint animation data (e.g., animated joint transforms for the simplified rig's joint hierarchy) and compressed vertex animation data (e.g., vertex offsets for each model control relative to the simplified rig). The compressed vertex animation data is sliced into small chunks (e.g., 256 frames) that can be streamed to a GPU (e.g., GPU 134) asynchronously during runtime without stalling. This animation clip data can be fed from the local machine or streamed from cloud-based servers. In order to ensure that streaming of character animation data to the GPU does not cause hitches, the data import module 118 can implement a centralized scheduler to queue and stream slices of animation as needed. As mentioned above, character animation data can include data pertaining to a plurality of animation clips. In certain embodiments, a first chunk (e.g., 256 frames) of all animation clips can be streamed to GPU memory at VR application startup. In certain embodiments, the data import module 118 can stream character animation data stored locally on a VR device for optimal load performance. This however can lead to a large VR application footprint (e.g., large local storage usage). Alternatively, the data import module 118 can stream the character animation data from a remote server. In certain embodiments, as needed during VR application use, additional chunks can be fetched and stored locally in anticipation of being streamed to the GPU, while chunks no longer needed can be discarded from local storage, in a manner that balances availability of local storage and streaming speed between the local machine and cloud-based servers.”).
 	Thus, in order to conserve the availability of local storage on a receiving device, it would have been obvious to one of ordinary skill in the art to have modified the method/system taught by the combination of SHUKLA, PREVOST and ORVALHO so that the first batch is discarded at the second time, as taught by DIRKSEN. 
   	Claim 8 is rejected under 35 U.S.C. 103 as being unpatentable over SHUKLA et al. (US 2019/0279642) in view of PREVOST et al. (US 6,570,555), further in view of RABINOVICH et al. (US 2020/0111262, hereinafter “RABINOVICH”).
 	Regarding claim 8 (depends on claim 1), whereas neither SHUKLA nor PREVOST are explicit as to, RABINOVICH teaches:
 	inspecting environmental information to identify a portion of the environment representative of a floor of the environment (¶ [0058]: “An exemplary environment observation module 702 can receive one or more sensor inputs 704a-704n. Sensor inputs 704a-704n can comprise inputs for SLAM. SLAM can be used by a MR system (e.g., MR system 212, 300) to identify physical features in a physical environment and locate those physical features relative to the physical environment and relative to each other. Simultaneously, the MR system (e.g., MR system 212, 300) can locate itself within the physical environment and relative to the physical features. SLAM can construct an understanding of a user's physical environment, which can allow a MR system (e.g., MR system 212, 300) to create a virtual environment that respects and interacts with a user's physical environment. For example, for a MR system (e.g., MR system 212, 300) to display a virtual AI companion near a user, it can be desirable for the MR system to identify a physical floor of the user's physical environment and display a virtual human avatar as standing on the physical floor. In some embodiments, as a user walks around a room, a virtual human avatar can move with the user (like a physical companion), and it can be desirable for the virtual human avatar to recognize physical obstacles (e.g., a table) so that the virtual human avatar does not appear to walk through the table. In some embodiments, it can be desirable for a virtual human avatar to appear as sitting down when a user sits down. It can therefore be beneficial for SLAM to recognize a physical object as a chair and recognize dimensions of the chair so that a MR system (e.g., MR system 212, 300) can display the virtual human avatar as sitting in the chair. Integrating a virtual environment displayed to a user with the user's physical environment can create a seamless experience that feels natural to the user, as if the user was interacting with a physical entity.”); and 
 	positioning the virtual character at a first position above the portion of the environment representative of the floor of the environment (¶ [0058]: “An exemplary environment observation module 702 can receive one or more sensor inputs 704a-704n. Sensor inputs 704a-704n can comprise inputs for SLAM. SLAM can be used by a MR system (e.g., MR system 212, 300) to identify physical features in a physical environment and locate those physical features relative to the physical environment and relative to each other. Simultaneously, the MR system (e.g., MR system 212, 300) can locate itself within the physical environment and relative to the physical features. SLAM can construct an understanding of a user's physical environment, which can allow a MR system (e.g., MR system 212, 300) to create a virtual environment that respects and interacts with a user's physical environment. For example, for a MR system (e.g., MR system 212, 300) to display a virtual AI companion near a user, it can be desirable for the MR system to identify a physical floor of the user's physical environment and display a virtual human avatar as standing on the physical floor. In some embodiments, as a user walks around a room, a virtual human avatar can move with the user (like a physical companion), and it can be desirable for the virtual human avatar to recognize physical obstacles (e.g., a table) so that the virtual human avatar does not appear to walk through the table. In some embodiments, it can be desirable for a virtual human avatar to appear as sitting down when a user sits down. It can therefore be beneficial for SLAM to recognize a physical object as a chair and recognize dimensions of the chair so that a MR system (e.g., MR system 212, 300) can display the virtual human avatar as sitting in the chair. Integrating a virtual environment displayed to a user with the user's physical environment can create a seamless experience that feels natural to the user, as if the user was interacting with a physical entity.”).
	Thus, in order to obtain a more versatile augmented reality system for controlling a virtual character having the cumulative features and/or functionalities taught by SHUKLA, PREVOST and RABINOVICH, it would have been obvious to one of ordinary skill in the art to have modified the method for controlling a virtual character taught by the combination of SHUKLA and PREVOST to also incorporate inspecting environmental information to identify a portion of the environment representative of a floor and positioning the virtual character at a first position above the portion of the environment representative of the floor, as is clearly taught by RABINOVICH.
   	Claim 10 is rejected under 35 U.S.C. 103 as being unpatentable over SHUKLA et al. (US 2019/0279642) in view of RABINOVICH et al. (US 2020/0111262).
	Regarding claim 10 (depends on claim 9), whereas SHUKLA is not explicit as to,  RABINOVICH teaches that the at least one processor is further configured to:
 	display the virtual character on the display of the device in a position in the environment derived from the environmental information (¶ [0058]: “An exemplary environment observation module 702 can receive one or more sensor inputs 704a-704n. Sensor inputs 704a-704n can comprise inputs for SLAM. SLAM can be used by a MR system (e.g., MR system 212, 300) to identify physical features in a physical environment and locate those physical features relative to the physical environment and relative to each other. Simultaneously, the MR system (e.g., MR system 212, 300) can locate itself within the physical environment and relative to the physical features. SLAM can construct an understanding of a user's physical environment, which can allow a MR system (e.g., MR system 212, 300) to create a virtual environment that respects and interacts with a user's physical environment. For example, for a MR system (e.g., MR system 212, 300) to display a virtual AI companion near a user, it can be desirable for the MR system to identify a physical floor of the user's physical environment and display a virtual human avatar as standing on the physical floor. In some embodiments, as a user walks around a room, a virtual human avatar can move with the user (like a physical companion), and it can be desirable for the virtual human avatar to recognize physical obstacles (e.g., a table) so that the virtual human avatar does not appear to walk through the table. In some embodiments, it can be desirable for a virtual human avatar to appear as sitting down when a user sits down. It can therefore be beneficial for SLAM to recognize a physical object as a chair and recognize dimensions of the chair so that a MR system (e.g., MR system 212, 300) can display the virtual human avatar as sitting in the chair. Integrating a virtual environment displayed to a user with the user's physical environment can create a seamless experience that feels natural to the user, as if the user was interacting with a physical entity.”); and 
 	implement the action that includes both the audio to be outputted on the device and a selected animation to be performed by the virtual character by modifying the virtual character in the environment presented on the device (¶ [0025]: “A mixed reality system can present to the user, for example using a transmissive display and/or one or more speakers (which may, for example, be incorporated into a wearable head device), a mixed reality environment (“MRE”) that combines aspects of a real environment and a virtual environment.”  ¶ [0062]: “An exemplary user observation module 708 can receive one or more sensor inputs 710a-710n (which can correspond to sensor inputs 704a-704n). Sensor inputs 710a-710n can capture information about a user and a user's response to various stimuli in a MRE. In some embodiments, sensor inputs 710a-710n can capture a user's explicit response to various stimuli in a MRE. For example, sensor inputs 710a-710n can comprise an audio signal captured by one or more microphones on a MR system (e.g., MR system 212, 300). In some embodiments, a user can state aloud “I like that,” which can be recorded by one or more microphones on a MR system (e.g., MR system 212, 300). The one or more microphones can process the audio signal to transcribe the user's speech, and this transcription can be fed into, for example, a natural language processing unit to determine a meaning behind the spoken words. In some embodiments, a MR system (e.g., MR system 212, 300) can determine that the audio signal originated from a user wearing the MR system. For example, the audio signal can be processed and compared to one or more previous known recordings of the user's voice to determine if the user is the speaker. In other embodiments, two microphones positioned on a MR system (e.g., MR system 212, 300) can be equidistant from a user's mouth; the audio signals captured by the two microphones can therefore contain approximately the same speech signal at approximately the same amplitude, and this information can be used to determine that the user is the speaker.”  ¶ [0072]: “Database 802 can be used to present a virtual companion in a MRE, and database 802 can comprise a variety of information. For example, database 802 can comprise a memory graph 804a (which can correspond to memory graph 701), and memory graph 804a can represent all (or at least a portion of) known and/or learned information about a user. Database 802 can also comprise scripted information 804b. Scripted information 804b can include scripted animations and/or poses that a MR system (e.g., MR system 212, 300) can use to render a virtual companion as a human avatar. For example, scripted information 804b can comprise a recording of a human actor walking, sitting, and running, which can have been animated (e.g., into a mesh animation). Scripted information 804b can also comprise voice recordings of human actors, which can be broken down into linguistic building blocks and used to synthesize a human voice for a virtual companion. Database 802 can also comprise learned information 804c. In some embodiments, learned information 804c can supplement and/or override scripted information 804b. For example, learned information 804c can comprise information that the user speaks in a particular natural language and/or in a particular accent. A MR system (e.g., MR system 212, 300) can learn this language and/or accent through audio recordings of the user speaking (e.g., via machine-learning), and may modify a scripted voice recording and/or generate new voice recordings to synthesize into human speech with an appropriate language and/or accent. Database 802 can further comprise information from user prompts 804d. User prompts 804d can comprise information obtained directly from the user. For example, a virtual companion may ask a user questions as part of an initialization process (e.g., the virtual companion can “introduce” itself to the user, and ask questions that may be typical of an introduction). In some embodiments, some or all of the information contained in 804b-804d may also be represented in memory graph 804a.”  ¶ [0073]: “Information stored in database 802 can be used to present a large volume of detailed and personalized information to a user. For example, a user can ask a virtual companion “Where did I stay when I went to London last year?” Database 802 and/or memory graph 804a can be queried, and a virtual companion can tell the user what hotel the user stayed at based on information collected on the user.”  ¶ [0074]: “Environment module 808 can also be used to present a virtual companion in a MRE in a seamless manner, such that the virtual companion appears as a real companion in the real environment. For example, environment module 808 can determine the presence of an empty chair near the user. When the user sits down, a MR system (e.g., MR system 212, 300) can display a human avatar as inhabiting the same space as the user and sitting down in the empty chair as well. Similarly, when a user walks around, a human avatar can be displayed as moving with the user, and the human avatar can be displayed as avoiding physical obstacles like a chair, and generally respecting the physical environment (e.g., traversing up a set of stairs instead of walking through them).”  ¶ [0075]: “User observation module 814 can also be used to present a virtual companion in a MRE in a seamless manner, such that the presented emotional state of the virtual companion mirrors (or at least approximates) that of the user, determined as described above based on explicit and/or implicit cues from the user. For example, user observation module 814 can determine a user's general mood (e.g., determining that a user is happy based on an inward facing camera that captures information about the user smiling), and the virtual companion can mirror the user's behavior (e.g., the virtual companion can also be displayed as smiling).”   ¶ [0076]: “In some embodiments, database 802, environment observation module 808, and user observation module 814 can provide information that can be combined to present a seamless virtual companion experience in the MRE that the user inhabits. In some embodiments, sensors on a MR system (e.g., MR system 212, 300) allow a virtual companion to present information in a user's MRE, in some instances without requiring any prompting from the user. For example, a MR system (e.g., MR system 212, 300) can determine that a user is discussing accommodations in London with another person (e.g., microphones on a MR system detect an audio signal that is transcribed and sent to a natural language processor, and cameras on the MR system detect and identify a person in the field of view of the user) and that the user is attempting to recall information (e.g., an inward facing camera on a MR system detects the user's eyes looking upwards). Database 802 can then be accessed and the contextual information be used from the environment observation module 808 and the user observation module 814 to determine which hotel the user stayed at during their previous trip to London. This information can then be presented to the user in real-time in an unobtrusive and accessible manner (e.g., via a virtual text bubble that is displayed to the user, or via an information card held up by a virtual companion). In other embodiments, a virtual companion can present information (learned explicitly and/or implicitly) to a user in their MRE through explicit prompts by the user (e.g., the user may ask the virtual companion where they stayed in London).”   ¶ [0077]: “In some embodiments, a virtual companion can interact with a user and the user's MRE. For example, a virtual companion can present itself as a virtual avatar of a dog, and the user can play fetch with the virtual companion. The user can throw a virtual or physical stick, and the virtual companion can be presented as moving in the user's inhabited physical environment and respecting obstacles in the physical environment (e.g., by moving around the obstacles). In another example, a MR system (e.g., MR system 212, 300) can connect to other devices (e.g., a smart lightbulb), and the user can request that the virtual companion turn on the lights. A virtual companion that can access data provided by a MR system (e.g., MR system 212, 300) has many benefits. For example, information can be continuously recorded by the MR system without intervention by the user (whether a virtual companion is currently being displayed or not). Similarly, information can be presented to the user without user intervention based on the continuously recorded information.”  ¶ [0079]: “Referring to FIG. 9A, a human user (“Alex”) is shown sitting on a couch in his real living room; he is wearing a wearable computing system (e.g., MR system 212, 300), and this system creates a mesh of the room and objects around him as shown in FIG. 9B. Also referring to FIG. 9B, a virtual companion (who can be named “Aya”) appears, looking a like a hologram in the depicted illustration. Referring to FIGS. 9C-9E, in this embodiment, Aya notices that the room is unusually dark (e.g., via cameras on a MR system 212, 300), drawing on observations of Alex's preferences (e.g., observations stored and associated in a memory graph 701, 804a), and turns up the actual/physical lights in the room for Alex (e.g., via a wireless connection to a smart lightbulb). Aya proceeds to scan the environment and understand its context (e.g., using SLAM and sensors on a MR system 212, 300). The scene is segmented, objects are detected, and are stored in Aya's memory, which may be termed a “Lifestream”, which is depicted as an association of information nodes to the right of FIGS. 9C-9I. A Lifestream can correspond to memory graph 701, 804a. In one embodiment, a Lifestream may be defined as the theoretical perfect data set that captures the total experiential flow of a person (e.g., from birth through death) including both physical and virtual observations and experiences.”  ¶ [0080]: “Referring to FIG. 9F, Alex looks at Aya and asks: “Aya, what was playing at the Pink Floyd concert last summer that I liked?” Referring to FIG. 9G, Aya queries the Lifestream and retrieves a memory of the concert, and says: “Another brick in the wall” she replies. Referring to FIG. 9H, Alex comments: “Wow that's amazing! I never would've remembered without your help. Can you play it on the TV, please?” Aya gets the music video going on the actual TV in the room, or alternatively can present the video via an augmented reality TV for Alex. Audio may be presented to Alex through his headset or other speakers, for example.”  ¶ [0081]: “Referring to FIG. 9I, after their dialog, another actual person (“Erica”) enters the room and greets Alex. Aya scans Erica's face and recognizes her. Aya perceives Alex's reaction to Erica through the cameras positioned adjacent Alex's eyes on the wearable computer system component (e.g., MR system 212, 300), and “sees” that he's happy to see Erica. Aya creates another memory snapshot, and stores it in the Lifestream.”  ¶ [0082]: “Referring to FIG. 9J, after Alex says hello to Erica, he lets her know that Aya just reminded him of a song he liked at the Pink Floyd concert. Erica replies that she'd like to hear it, so Alex asks Aya to play the song through the physical speakers in the room so that Erica may also hear. Aya turns on the song for all to hear, tells John that she'll talk to him later, and disappears.”  ¶ [0083]: “Referring to FIGS. 10A and 10B, a virtual, digital, and/or mixed or augmented reality assistant or companion, such as the embodiment highlighted herein, called “Mica”, preferably is configurable to have certain capabilities and traits, such as approachability, empathy, understanding, memory, and expression. Various factors may provide inputs to the presentation of such an AI assistant or companion, such as lighting and realistic glow, realistic locomotion models, user-based reaction models, and attention models. Computer graphics, animation, capture and scanning systems can be critical to creating a lifelike virtual companion, and painstaking detail can be required to achieve a compelling experience. It can take experts from a variety of disciplines to collaborate closely. Get one thing wrong, and the character can be alienating—but when you get things right you can achieve presence and agency. Relative to any other type of character, a digital human arguably is the most difficult, but it is also what users can be most familiar with, and therefore can be a most fulfilling means for developing approachable AI. In mixed reality, as compared with motion pictures, the bar arguably is higher. Interactions with characters are not scripted; by definition, the user should affect how the character responds. For example, after developing an accurate synthetic eye representation system, a character and AI systems can be set up to track gaze with the user. Users can have strong opinions on the character, for example, commenting in ways as they would describe a human. This can be important for developing a human-centered interface to AI. With these developments, important attributes may require special focus AI-related systems are designed and evolve. As noted above, it can be desirable for the system to present a persona to the user which is approachable, empathetic, persistent (i.e., have memory and utilize the concept of Lifestream) and be knowledgeable and helpful. These developments can become the gateways to making AI less alienating and more natural to the user. While the challenges of representing humans or characters are many, character embodiments also tap into subtle nuances of knowledge and understanding that all people have.”).
  	Thus, in order to obtain a more user friendly and more versatile system having the cumulative features and/or functionalities taught by SHUKLA and RABINOVICH for controlling a virtual character in a virtual reality and/or augmented reality, it would have been obvious to one of ordinary skill in the art to have modified the method for controlling a virtual character taught by SHUKLA to also incorporate displaying the virtual character on the display of the device in a position in the environment derived from the environmental information, as taught by RABINOVICH.
  	Claims 13-16 are rejected under 35 U.S.C. 103 as being unpatentable over SHUKLA et al. (US 2019/0279642) in view RIESEN et al. (US 2021/0166461, hereinafter “REISEN”)    .
 	Regarding claim 13, SHUKLA discloses a computer-implemented method to dynamically generate a virtual character (¶ [0009]: “methods, systems, and programming for a computerized intelligent agent.”  ¶ [0062]: “the automated companion 160-a may include reactive expressions in response to a user's response via, e.g., an interactive video cartoon character (e.g., avatar) displayed on, e.g., a screen as part of a face on the automated companion.”), the computer-implemented method comprising: 
  	displaying the virtual character (e.g., ¶ [0062]: “an interactive video cartoon character (e.g., avatar) displayed”) (e.g., ¶ [0062]: “on, e.g., a screen as part of a face on the automated companion.”) (¶ [0062]: “Other exemplary functionalities of the automated companion 160-a may include reactive expressions in response to a user's response via, e.g., an interactive video cartoon character (e.g., avatar) displayed on, e.g., a screen as part of a face on the automated companion.”   ¶ [0068]: “Such information may include common configuration to be applied to a dialogue (e.g., character of the agent device is an avatar, voice preferred, or a virtual environment to be created for the dialogue, etc.), a current state of the dialogue, a current dialogue history, known user preferences, estimated user intent/emotion/mindset, etc.”); 
 	receiving multi-modal input information (¶ [0046]: “continuously monitored multimodal data indicative of the surrounding of the dialogue, adaptively estimating the mindset/emotion/intent of the participants of the dialogue, and adaptively adjust the conversation strategy based on the dynamically changing information/estimates/contextual information.”) from the user device (¶ [0054]: “communication may involve data in multiple modalities such as audio, video, text, etc. Via a user device, a user can send data (e.g., a request, audio signal representing an utterance of the user, or a video of the scene surrounding the user) and/or receive data (e.g., text or audio response from an agent device). In some embodiments, user data in multiple modalities, upon being received by an agent device or the user interaction engine 140, may be analyzed to understand the human user's speech or gesture so that the user's emotion or intent may be estimated and used to determine a response to the user.”  ¶ [0058]: “In some embodiment, the multi-modal sensor data may first be processed on the user device and important features in different modalities may be extracted and sent to the user interaction engine 140 so that dialogue may be controlled with an understanding of the context. In some embodiments, the raw multi-modal sensor data may be sent directly to the user interaction engine 140 for processing.”  ¶ [0066]: “multi-modal data may be acquired via sensors such as camera, microphone, keyboard, display, speakers, chat bubble, renderer, or other user interface elements.”   ¶ [0119]: “the audio based speech recognition unit 930 and the lip reading based speech recognizer 950 may respectively receive audio and visual signals as input”),
 the multi-modal input information including
 speech information (Abstract: “An audio signal is received that represents a speech of a user engaged in a dialogue.  A visual signal is received that captures the user uttering the speech.  A first speech recognition result is obtained by performing audio based speech recognition based on the audio signal. Based on the visual signal, lip movement of the user is detected and a second speech recognition result is obtained by performing lip reading based speech recognition.”    ¶ [0058]: “the user device 110-a may continually collect multi-modal sensor data related to the user and his/her surroundings, which may be analyzed to detect any information related to the dialogue and used to intelligently control the dialogue in an adaptive manner.”  ¶ [0071]: “To manage the dialogue, the automated companion may activate different sensors during the dialogue to make observation of the user and the surrounding environment. For example, the agent device may acquire multi-modal data about the surrounding environment where the user is in. Such multi-modal data may include audio, visual, or text data. For example, visual data may capture the facial expression of the user. The visual data may also reveal contextual information surrounding the scene of the conversation. For instance, a picture of the scene may reveal that there is a basketball, a table, and a chair, which provides information about the environment and may be leveraged in dialogue management to enhance engagement of the user. Audio data may capture not only the speech response of the user but also other peripheral information such as the tone of the response, the manner by which the user utters the response, or the accent of the user.”  ¶ [0110]: “The speech sound detector 1110 is provided to detect, from the input audio data, sounds that likely correspond to human speech activities based on, e.g., models 1120 that characterize human speech sound.”  ¶ [0120]: “Acoustic input data acquired by selected acoustic sensor(s) may then be sent to the audio based speech recognition unit 1530 for speech recognition based on audio data.”),
 facial expression information (Abstract: “A visual signal is received that captures the user uttering the speech.”  Abstract: “Based on the visual signal, lip movement of the user is detected and a second speech recognition result is obtained by performing lip reading based speech recognition.”    ¶ [0062]: “The automated companion may use a camera (320) to observe the user's presence, facial expressions, direction of gaze, surroundings, etc.”  ¶ [0071]: “To manage the dialogue, the automated companion may activate different sensors during the dialogue to make observation of the user and the surrounding environment. For example, the agent device may acquire multi-modal data about the surrounding environment where the user is in. Such multi-modal data may include audio, visual, or text data. For example, visual data may capture the facial expression of the user. The visual data may also reveal contextual information surrounding the scene of the conversation. For instance, a picture of the scene may reveal that there is a basketball, a table, and a chair, which provides information about the environment and may be leveraged in dialogue management to enhance engagement of the user. Audio data may capture not only the speech response of the user but also other peripheral information such as the tone of the response, the manner by which the user utters the response, or the accent of the user.”  ¶ [0120]: “Visual input data acquired by selected visual sensor may then be sent to the lip reading based speech recognition unit 1550 for speech recognition based on visual data.”  ¶ [0120]: “the lip reading based speech recognition unit 1550 performs speech recognition, at 1650, by comparing tracked lip movements (observed in the visual input data) against some lip reading model(s) appropriate for the underlying language for the speech recognition.”),
 and environmental information representing an environment (¶ [0051]: “During a conversation, an agent device may include one or more input mechanisms (e.g., cameras, microphones, touch screens, buttons, etc.) that allow the agent device to capture inputs related to the user or the local environment associated with the conversation. Such inputs may assist the automated companion to develop an understanding of the atmosphere surrounding the conversation (e.g., movements of the user, sound of the environment) and the mindset of the human conversant (e.g., user picks up a ball which may indicates that the user is bored) in order to enable the automated companion to react accordingly and conduct the conversation in a manner that will keep the user interested and engaging.”   ¶ [0055]: “Knowing that the surrounding environment of the user (based on visual information from the user device) shows green trees and lawns,”  ¶ [0058]: “the user device 110-a may continually collect multi-modal sensor data related to the user and his/her surroundings, which may be analyzed to detect any information related to the dialogue and used to intelligently control the dialogue in an adaptive manner.”   ¶ [0071]: “To manage the dialogue, the automated companion may activate different sensors during the dialogue to make observation of the user and the surrounding environment. For example, the agent device may acquire multi-modal data about the surrounding environment where the user is in. Such multi-modal data may include audio, visual, or text data. For example, visual data may capture the facial expression of the user. The visual data may also reveal contextual information surrounding the scene of the conversation. For instance, a picture of the scene may reveal that there is a basketball, a table, and a chair, which provides information about the environment and may be leveraged in dialogue management to enhance engagement of the user. Audio data may capture not only the speech response of the user but also other peripheral information such as the tone of the response, the manner by which the user utters the response, or the accent of the user.”   ¶ [0110]: “In some embodiments, depending on application needs, it is possible to also detect other types of sounds such as environmental sounds (beach, street, sports center, etc.), special event sounds (explosion, fire alarm, alerts, etc.). In this case, the 1120 may also include models that can be used to detect different types of sound in the dialogue scene.”); 
 	implementing at least two internal models (e.g., ¶ [0042]: “detecting a spoken language based on multiple model based speech recognition”; ¶ [0094]: “sound models 715”; ¶ [0094]: “speech lip movement models 725“; ¶ [0106]: “a lip detection model 930.” ¶ [0110]: “based on, e.g., models 1120 that characterize human speech sound.”  ¶ [0110]: “models that can be used to detect different types of sound in the dialogue scene.“  ¶ [0126]: “the lip shape/sound model(s)”) to identify characteristics of the multi-modal input information (e.g., ¶ [0110]: “to detect, from the input audio data, sounds that likely correspond to human speech activities based on, e.g., models 1120 that characterize human speech sound.” ¶ {0110]: “models that can be used to detect different types of sound in the dialogue scene.”   ¶ [0095]: “audio cues that reveal human speech activities and video cues related to lip movement that evidences human speech”) (¶ [0094]: “The audio based sound source estimator 710 processes audio data collected from a dialogue scene and estimates one or more sound sources (for speech) based on sound models 715 (e.g., acoustic models for human speech). The visual based sound source estimator 720 is provided for estimating one or more candidate sources (directions in a dialogue scene) of speech activities in a dialogue scene based on visual cues. The visual based sound source estimator 720 processes image data collected from the dialogue scene, analyzes the visual information based on speech lip movement models 725 (e.g., visual models for lip movement in speech in certain languages), and estimates candidate sound source(s) where the human speech is occurring. The audio based sound source candidates estimated by the audio based sound source estimator 710 and the visual based sound source estimates from 720 are sent, respectively, to the sound source disambiguation unit 730 so that the estimated sound candidates determined based on different cues may be disambiguated to generate estimated source(s) of sound in a dialogue environment.” ¶ [0095]: “an integrated approach by combining audio and video cues, including audio cues that reveal human speech activities and video cues related to lip movement that evidences human speech. In operation, the visual based sound source estimator 720 receives, at 702 of FIG. 7B, image (video) data acquired from the dialogue scene and processes the video data to detect, at 712, lip movement based on speech lip movement models 725 for recognizing speech activities. In some embodiments, the speech lip movement models to be used for the detection may be selected with respect to a certain language.”  ¶ [0120]: “When the audio based speech recognition unit 1530 receives the audio signals from acoustic sensor(s), it performs, at 1630, speech recognition based on speech recognition models 1540”;  ¶ [0120]: “Similarly, when the lip reading based speech recognizer 1550 receives the visual data (video), it performs, at 1650, speech recognition based on lip reading in accordance with lip reading models 1560.”  ¶ [0120]: “Thus, the lip reading based speech recognition unit 1550 performs speech recognition, at 1650, by comparing tracked lip movements (observed in the visual input data) against some lip reading model(s) appropriate for the underlying language for the speech recognition. The appropriate lip reading model may be selected (from the lip reading models 1560) based on, e.g., an input related to language choice.”  ¶ [0126]: “Mapping lip shape and/or lip movement to a sound may involve viseme analysis, where a viseme may correspond to a generic image that is used to describe a particular sound. As commonly known, a viseme may be a visual equivalent of a phoneme or acoustic speech sound in a spoken language and can be used by hearing-impaired person to view sounds visually. To derive a viseme, the analysis needed may depend on the underlying spoken language. In the present teaching, the lip shape/sound model(s) from 1960 may be used for determining sounds corresponding to lip shapes. In recognizing visemes associated with a spoken language, an appropriate lip shape/sound model may be selected according to a known current language.”    ¶ [0095]: “automatic speech recognition (ASR)”;  ¶ [0130]: “ In some embodiments, the integration may also be performed at an even lower level. For instance, the integration may be performed based on phonemes estimated based on sound (audio based) or visemes recognized based on lip reading (visual based).  FIG. 21 illustrates an exemplary scheme for integrating audio based speech recognition (ASR) and the lip reading based speech recognition, according to a different embodiment of the present teaching. As shown, speech signal is processed respectively via ASR and video data are processed via lip recognition. In some embodiments, the ASR generates phonemes and the lip reading generates visemes. To integrate the recognition results, comparison is performed between the recognition results. For example, the phonemes from the ASR are converted into visemes and similarity between the visemes converted from phonemes from the ASR and that from the lip reading are assessed. In some embodiments, if they are similar, e.g., the similarity exceeds a certain level, the recognition result from ASR is accepted because it is supported by the lip reading result. If the similarity level of the visemes from ASR and lip reading is below a set level, the visemes may be accepted but the recognition result may be associated with a low confidence score. In some embodiments, the automated dialogue companion or the agent may request the user engaged in the dialogue to speak louder so that the next round of recognition may be based on better signals. In some situations, if the similarity is low according to some criterion, the visemes may not be accepted and the automated dialogue companion may react to the situation by letting the user know that what is spoken cannot be discerned and ask the user to say it again.”); 
 	inspecting the characteristics identified by the at least two internal models to determine (¶ [0130]: “similarity between the visemes converted from phonemes from the ASR and that from the lip reading are assessed.”) whether a first identified characteristic (e.g., ¶ [0130]: “phonemes estimated based on sound (audio based)”; ¶ [0130]: “ASR generates phonemes”) is within a threshold similarity (¶ [0130]: “the phonemes from the ASR are converted into visemes and similarity between the visemes converted from phonemes from the ASR and that from the lip reading are assessed.”  ¶ [0130]: “if they are similar, e.g., the similarity exceeds a certain level,”  ¶ [0130]: “If the similarity level of the visemes from ASR and lip reading is below a set level”) to a second identified characteristic (¶ [0130]: “visemes recognized based on lip reading (visual based);  ¶ [0130]: “the lip reading generates visemes”) (¶ [0130]: “In some embodiments, the integration may also be performed at an even lower level. For instance, the integration may be performed based on phonemes estimated based on sound (audio based) or visemes recognized based on lip reading (visual based).  FIG. 21 illustrates an exemplary scheme for integrating audio based speech recognition (ASR) and the lip reading based speech recognition, according to a different embodiment of the present teaching. As shown, speech signal is processed respectively via ASR and video data are processed via lip recognition. In some embodiments, the ASR generates phonemes and the lip reading generates visemes. To integrate the recognition results, comparison is performed between the recognition results. For example, the phonemes from the ASR are converted into visemes and similarity between the visemes converted from phonemes from the ASR and that from the lip reading are assessed. In some embodiments, if they are similar, e.g., the similarity exceeds a certain level, the recognition result from ASR is accepted because it is supported by the lip reading result. If the similarity level of the visemes from ASR and lip reading is below a set level, the visemes may be accepted but the recognition result may be associated with a low confidence score. In some embodiments, the automated dialogue companion or the agent may request the user engaged in the dialogue to speak louder so that the next round of recognition may be based on better signals. In some situations, if the similarity is low according to some criterion, the visemes may not be accepted and the automated dialogue companion may react to the situation by letting the user know that what is spoken cannot be discerned and ask the user to say it again.”); 
 	comparing the first identified characteristic (claim 2: “the first speech recognition result includes a plurality of phonemes”) and the second identified characteristic claim 2: “the second speech recognition result includes a plurality of visemes.”) (¶ [0130]: “To integrate the recognition results, comparison is performed between the recognition results. For example, the phonemes from the ASR are converted into visemes and similarity between the visemes converted from phonemes from the ASR and that from the lip reading are assessed. In some embodiments, if they are similar, e.g., the similarity exceeds a certain level, the recognition result from ASR is accepted because it is supported by the lip reading result.” ¶ [0069]: “responses from a user”  NOTE: In other words, a determined response from a user comprises the integrated speech recognition results having the identified similar phonemes and visemes.) against information specific to the virtual character (e.g., ¶ [0069]: “paths which may be taken depending on a response detected from a user”;  NOTE: In other words, the recognized speech (which includes the particular phonemes and visemes) of each response from a user is compared against paths in the paths in the dialogue tree.) included in a virtual character knowledge model to select (e.g., ¶ [0069]: “a dialogue tree of an on-going dialogue”; ¶ [0069]: “paths which may be taken depending on a response detected from a user”;  NOTE: In other words, the recognized speech (which includes the particular phonemes and visemes) of each response from a user is compared against paths in the paths in the dialogue tree.) (¶ [0069]: “FIG. 4B illustrates a part of a dialogue tree of an on-going dialogue with paths taken based on interactions between the automated companion and a user, according to an embodiment of the present teaching. In this illustrated example, the dialogue management at layer 3 (of the automated companion) may predict multiple paths with which a dialogue, or more generally an interaction, with a user may proceed. In this example, each branch from a node may represent possible responses from a user. As shown in this example, at node 1, the automated companion may face with three separate paths which may be taken depending on a response detected from a user. If the user responds with an affirmative response, dialogue tree 400 may proceed from node 1 to node 2. At node 2, a response may be generated for the automated companion in response to the affirmative response from the user and may then be rendered to the user, which may include audio, visual, textual, haptic, or any combination thereof.”  ¶ [0070]: “If, at node 1, the user responses negatively, the path is for this stage is from node 1 to node 10. If the user responds, at node 1, with a “so-so” response (e.g., not negative but also not positive), dialogue tree 400 may proceed to node 3, at which a response from the automated companion may be rendered and there may be three separate possible responses from the user, “No response,” “Positive Response,” and “Negative response,” corresponding to nodes 5, 6, and 7, respectively. Depending on the user's actual response with respect to the automated companion's response rendered at node 3, the dialogue management at layer 3 may then follow the dialogue accordingly. For instance, if the user responds at node 3 with a positive response, the automated companion moves to respond to the user at node 6. Similarly, depending on the user's reaction to the automated companion's response at node 6, the user may further respond with an answer that is correct. In this case, the dialogue state moves from node 6 to node 8, etc. In this illustrated example, the dialogue state during this period moved from node 1, to node 3, to node 6, and to node 8. The traverse through nodes 1, 3, 6, and 8 forms a path consistent with the underlying conversation between the automated companion and a user. As seen in FIG. 4B, the path representing the dialogue is represented by the solid lines connecting nodes 1, 3, 6, and 8, whereas the paths skipped during a dialogue is represented by the dashed lines.”      ¶ [0084]: “In each dialogue of a certain topic, the dialogue manager 510 bases its control of the dialogue on relevant dialogue tree(s) that may or may not be associated with the topic (e.g., may inject small talks to enhance engagement). To generate a response to a user in a dialogue, the dialogue manager 510 may also consider additional information such as a state of the user, the surrounding of the dialogue scene, the emotion of the user, the estimated mindsets of the user and the agent, and the known preferences of the user (utilities).”  ¶ [0056]: “In some embodiments, such inputs from the user's site and processing results thereof may also be transmitted to the user interaction engine 140 for facilitating the user interaction engine 140 to better understand the specific situation associated with the dialogue so that the user interaction engine 140 may determine the state of the dialogue, emotion/mindset of the user, and to generate a response that is based on the specific situation of the dialogue and the intended purpose of the dialogue (e.g., for teaching a child the English vocabulary). For example, if information received from the user device indicates that the user appears to be bored and become impatient, the user interaction engine 140 may determine to change the state of dialogue to a topic that is of interest of the user (e.g., based on the information from the user information database 130) in order to continue to engage the user in the conversation.”)
 a selected characteristic (¶ [0054]: “to determine a response to the user.”  ¶ [0056]: “determine the state of the dialogue, emotion/mindset of the user, and to generate a response that is based on the specific situation of the dialogue and the intended purpose of the dialogue“) based on determining that the first identified characteristic includes the threshold number of similar features of the second identified characteristic of the identified characteristics (¶ [0130]: “if they are similar, e.g., the similarity exceeds a certain level, the recognition result from ASR is accepted because it is supported by the lip reading result”)  (¶ [0056]: “In some embodiments, such inputs from the user's site and processing results thereof may also be transmitted to the user interaction engine 140 for facilitating the user interaction engine 140 to better understand the specific situation associated with the dialogue so that the user interaction engine 140 may determine the state of the dialogue, emotion/mindset of the user, and to generate a response that is based on the specific situation of the dialogue and the intended purpose of the dialogue (e.g., for teaching a child the English vocabulary). For example, if information received from the user device indicates that the user appears to be bored and become impatient, the user interaction engine 140 may determine to change the state of dialogue to a topic that is of interest of the user (e.g., based on the information from the user information database 130) in order to continue to engage the user in the conversation.”  ¶ [0067]: “The estimated mindsets of parties, whether related to humans or the automated companion (machine), may be relied on by the dialogue management at layer 3, to determine, e.g., how to carry on a conversation with a human conversant. How each dialogue progresses often represent a human user's preferences. Such preferences may be captured dynamically during the dialogue at utilities (layer 5). As shown in FIG. 4A, utilities at layer 5 represent evolving states that are indicative of parties' evolving preferences, which can also be used by the dialogue management at layer 3 to decide the appropriate or intelligent way to carry on the interaction.”   ¶ [0069]: “FIG. 4B illustrates a part of a dialogue tree of an on-going dialogue with paths taken based on interactions between the automated companion and a user, according to an embodiment of the present teaching. In this illustrated example, the dialogue management at layer 3 (of the automated companion) may predict multiple paths with which a dialogue, or more generally an interaction, with a user may proceed. In this example, each node may represent a point of the current state of the dialogue and each branch from a node may represent possible responses from a user. As shown in this example, at node 1, the automated companion may face with three separate paths which may be taken depending on a response detected from a user. If the user responds with an affirmative response, dialogue tree 400 may proceed from node 1 to node 2. At node 2, a response may be generated for the automated companion in response to the affirmative response from the user and may then be rendered to the user, which may include audio, visual, textual, haptic, or any combination thereof.”   ¶ [0084]: “In each dialogue of a certain topic, the dialogue manager 510 bases its control of the dialogue on relevant dialogue tree(s) that may or may not be associated with the topic (e.g., may inject small talks to enhance engagement). To generate a response to a user in a dialogue, the dialogue manager 510 may also consider additional information such as a state of the user, the surrounding of the dialogue scene, the emotion of the user, the estimated mindsets of the user and the agent, and the known preferences of the user (utilities).”  ¶ [0085]: “An output of DM 510 corresponds to an accordingly determined response to the user. To deliver a response to the user, the DM 510 may also formulate a way that the response is to be delivered. The form in which the response is to be delivered may be determined based on information from multiple sources, e.g., the user's emotion (e.g., if the user is a child who is not happy, the response may be rendered in a gentle voice), the user's utility (e.g., the user may prefer speech in certain accent similar to his parents'), or the surrounding environment that the user is in (e.g., noisy place so that the response needs to be delivered in a high volume). DM 510 may output the response determined together with such delivery parameters.”  ¶ [0130]: “In some embodiments, the integration may also be performed at an even lower level. For instance, the integration may be performed based on phonemes estimated based on sound (audio based) or visemes recognized based on lip reading (visual based).  FIG. 21 illustrates an exemplary scheme for integrating audio based speech recognition (ASR) and the lip reading based speech recognition, according to a different embodiment of the present teaching. As shown, speech signal is processed respectively via ASR and video data are processed via lip recognition. In some embodiments, the ASR generates phonemes and the lip reading generates visemes. To integrate the recognition results, comparison is performed between the recognition results. For example, the phonemes from the ASR are converted into visemes and similarity between the visemes converted from phonemes from the ASR and that from the lip reading are assessed. In some embodiments, if they are similar, e.g., the similarity exceeds a certain level, the recognition result from ASR is accepted because it is supported by the lip reading result. If the similarity level of the visemes from ASR and lip reading is below a set level, the visemes may be accepted but the recognition result may be associated with a low confidence score. In some embodiments, the automated dialogue companion or the agent may request the user engaged in the dialogue to speak louder so that the next round of recognition may be based on better signals. In some situations, if the similarity is low according to some criterion, the visemes may not be accepted and the automated dialogue companion may react to the situation by letting the user know that what is spoken cannot be discerned and ask the user to say it again.”); 
 	accessing a library of potential actions associated with the virtual character (e.g., in FIG. 4A, the “Database” storing “Character config,” “Voice config”;   ¶ [0068]: “In some embodiments, some information that may be shared via the internal API may be accessed from an external database. For example, certain configurations related to a desired character for the agent device (a duck) may be accessed from, e.g., an open source database, that provide parameters (e.g., parameters to visually render the duck and/or parameters needed to render the speech from the duck).”  and/or ¶ [0087]: “verbal response generation and/or behavior response generation, as depicted in FIG. 5.”  NOTE:  A databases may be reasonably interpreted as being “a library.”) to select an action that matches the selected characteristic (e.g., ¶ [0087]: “To deliver a response, a deliverable form of the response may be generated via, e.g., verbal response generation and/or behavior response generation, as depicted in FIG. 5. Such a response in its determined deliverable form(s) may then be used by a renderer to actual render the response in its intended form(s). For a deliverable form in a natural language, the text of the response may be used to synthesize a speech signal via, e.g., text to speech techniques, in accordance with the delivery parameters (e.g., volume, accent, style, etc.). For any response or part thereof, that is to be delivered in a non-verbal form(s), e.g., with a certain expression, the intended non-verbal expression may be translated into, e.g., via animation, control signals that can be used to control certain parts of the agent device (physical representation of the automated companion) to perform certain mechanical movement to deliver the non-verbal expression of the response, e.g., nodding head, shrug shoulders, or whistle. In some embodiments, to deliver a response, certain software components may be invoked to render a different facial expression of the agent device. Such rendition(s) of the response may also be simultaneously carried out by the agent (e.g., speak a response with a joking voice and with a big smile on the face of the agent).”   ¶ [0090]: “On the output side of the processing level, when a certain response strategy is determined, such strategy may be translated into specific actions to take by the automated companion to respond to the other party. Such action may be carried out by either deliver some audio response or express certain emotion or attitude via certain gesture. When the response is to be delivered in audio, text with words that need to be spoken are processed by a text to speech module to produce audio signals and such audio signals are then sent to the speakers to render the speech as a response. In some embodiments, the speech generated based on text may be performed in accordance with other parameters, e.g., that may be used to control to generate the speech with certain tones or voices. If the response is to be delivered as a physical action, such as a body movement realized on the automated companion, the actions to be taken may also be instructions to be used to generate such body movement.”)   (¶ [0068]: “In some embodiments, through the internal API, various layers (2-5) may access information created by or stored at other layers to support the processing. Such information may include common configuration to be applied to a dialogue (e.g., character of the agent device is an avatar, voice preferred, or a virtual environment to be created for the dialogue, etc.), a current state of the dialogue, a current dialogue history, known user preferences, estimated user intent/emotion/mindset, etc. In some embodiments, some information that may be shared via the internal API may be accessed from an external database. For example, certain configurations related to a desired character for the agent device (a duck) may be accessed from, e.g., an open source database, that provide parameters (e.g., parameters to visually render the duck and/or parameters needed to render the speech from the duck).”  ¶ [0069]: “At node 2, a response may be generated for the automated companion in response to the affirmative response from the user and may then be rendered to the user, which may include audio, visual, textual, haptic, or any combination thereof.”  ¶ [0081]: “Layer 1 also handles information rendering of a response from the automated dialogue companion to a user. In some embodiments, the rendering is performed by an agent device, e.g., 160-a and examples of such rendering include speech, expression which may be facial or physical acts performed. For instance, an agent device may render a text string received from the user interaction engine 140 (as a response to the user) to speech so that the agent device may utter the response to the user. In some embodiments, the text string may be sent to the agent device with additional rendering instructions such as volume, tone, pitch, etc. which may be used to convert the text string into a sound wave corresponding to an utterance of the content in a certain manner. In some embodiments, a response to be delivered to a user may also include animation, e.g., utter a response with an attitude which may be delivered via, e.g., a facial expression or a physical act such as raising one arm, etc. In some embodiments, the agent may be implemented as an application on a user device. In this situation, rendering of a response from the automated dialogue companion is implemented via the user device, e.g., 110-a (not shown in FIG. 5).”   ¶ [0083]: “The multimodal data understanding generated at layer 2 may be used by DM 510 to determine how to respond. To enhance engagement and user experience, the DM 510 may also determine a response based on the estimated mindsets of the user and of the agent from layer 4 as well as the utilities of the user engaged in the dialogue from layer 5. The mindsets of the parties involved in a dialogue may be estimated based on information from Layer 2 (e.g., estimated emotion of a user) and the progress of the dialogue. In some embodiments, the mindsets of a user and of an agent may be estimated dynamically during the course of a dialogue and such estimated mindsets may then be used to learn, together with other data, utilities of users. The learned utilities represent preferences of users in different dialogue scenarios and are estimated based on historic dialogues and the outcomes thereof.”   ¶ [0084]: “In each dialogue of a certain topic, the dialogue manager 510 bases its control of the dialogue on relevant dialogue tree(s) that may or may not be associated with the topic (e.g., may inject small talks to enhance engagement). To generate a response to a user in a dialogue, the dialogue manager 510 may also consider additional information such as a state of the user, the surrounding of the dialogue scene, the emotion of the user, the estimated mindsets of the user and the agent, and the known preferences of the user (utilities).”  ¶ [0085]: “An output of DM 510 corresponds to an accordingly determined response to the user. To deliver a response to the user, the DM 510 may also formulate a way that the response is to be delivered. The form in which the response is to be delivered may be determined based on information from multiple sources, e.g., the user's emotion (e.g., if the user is a child who is not happy, the response may be rendered in a gentle voice), the user's utility (e.g., the user may prefer speech in certain accent similar to his parents'), or the surrounding environment that the user is in (e.g., noisy place so that the response needs to be delivered in a high volume). DM 510 may output the response determined together with such delivery parameters.”  ¶ [0086]: “In some embodiments, the delivery of such determined response is achieved by generating the deliverable form(s) of each response in accordance with various parameters associated with the response. In a general case, a response is delivered in the form of speech in some natural language. A response may also be delivered in speech coupled with a particular nonverbal expression as a part of the delivered response, such as a nod, a shake of the head, a blink of the eyes, or a shrug. There may be other forms of deliverable form of a response that is acoustic but not verbal, e.g., a whistle.”  ¶ [0087]: “To deliver a response, a deliverable form of the response may be generated via, e.g., verbal response generation and/or behavior response generation, as depicted in FIG. 5. Such a response in its determined deliverable form(s) may then be used by a renderer to actual render the response in its intended form(s). For a deliverable form in a natural language, the text of the response may be used to synthesize a speech signal via, e.g., text to speech techniques, in accordance with the delivery parameters (e.g., volume, accent, style, etc.). For any response or part thereof, that is to be delivered in a non-verbal form(s), e.g., with a certain expression, the intended non-verbal expression may be translated into, e.g., via animation, control signals that can be used to control certain parts of the agent device (physical representation of the automated companion) to perform certain mechanical movement to deliver the non-verbal expression of the response, e.g., nodding head, shrug shoulders, or whistle. In some embodiments, to deliver a response, certain software components may be invoked to render a different facial expression of the agent device. Such rendition(s) of the response may also be simultaneously carried out by the agent (e.g., speak a response with a joking voice and with a big smile on the face of the agent).”),
 the action including an animation to be performed by the virtual character and associated audio (¶ [0081]: “Layer 1 also handles information rendering of a response from the automated dialogue companion to a user. In some embodiments, the rendering is performed by an agent device, e.g., 160-a and examples of such rendering include speech, expression which may be facial or physical acts performed. For instance, an agent device may render a text string received from the user interaction engine 140 (as a response to the user) to speech so that the agent device may utter the response to the user. In some embodiments, the text string may be sent to the agent device with additional rendering instructions such as volume, tone, pitch, etc. which may be used to convert the text string into a sound wave corresponding to an utterance of the content in a certain manner. In some embodiments, a response to be delivered to a user may also include animation, e.g., utter a response with an attitude which may be delivered via, e.g., a facial expression or a physical act such as raising one arm, etc.”   ¶ [0086]: “In some embodiments, the delivery of such determined response is achieved by generating the deliverable form(s) of each response in accordance with various parameters associated with the response. In a general case, a response is delivered in the form of speech in some natural language. A response may also be delivered in speech coupled with a particular nonverbal expression as a part of the delivered response, such as a nod, a shake of the head, a blink of the eyes, or a shrug.”    ¶ [0087]: “To deliver a response, a deliverable form of the response may be generated via, e.g., verbal response generation and/or behavior response generation, as depicted in FIG. 5. Such a response in its determined deliverable form(s) may then be used by a renderer to actual render the response in its intended form(s). For a deliverable form in a natural language, the text of the response may be used to synthesize a speech signal via, e.g., text to speech techniques, in accordance with the delivery parameters (e.g., volume, accent, style, etc.). For any response or part thereof, that is to be delivered in a non-verbal form(s), e.g., with a certain expression, the intended non-verbal expression may be translated into, e.g., via animation, control signals that can be used to control certain parts of the agent device (physical representation of the automated companion) to perform certain mechanical movement to deliver the non-verbal expression of the response, e.g., nodding head, shrug shoulders, or whistle. In some embodiments, to deliver a response, certain software components may be invoked to render a different facial expression of the agent device. Such rendition(s) of the response may also be simultaneously carried out by the agent (e.g., speak a response with a joking voice and with a big smile on the face of the agent).”); and 
	displaying the virtual character (e.g., ¶ [0062]: “an interactive video cartoon character (e.g., avatar) displayed”) in the environment (e.g., ¶ [0062]: “on, e.g., a screen as part of a face on the automated companion.”  NOTE:  Since the user and the automated companion interact with one another in a “face-to-face” dialogue, clearly the display screen of the automated companion is in the environment with the user.) (¶ [0062]: “Other exemplary functionalities of the automated companion 160-a may include reactive expressions in response to a user's response via, e.g., an interactive video cartoon character (e.g., avatar) displayed on, e.g., a screen as part of a face on the automated companion.”   ¶ [0068]: “Such information may include common configuration to be applied to a dialogue (e.g., character of the agent device is an avatar, voice preferred, or a virtual environment to be created for the dialogue, etc.), a current state of the dialogue, a current dialogue history, known user preferences, estimated user intent/emotion/mindset, etc.”)
performing the action (e.g., ¶ [0081]: “a response to be delivered to a user may also include animation, e.g., utter a response with an attitude which may be delivered via, e.g., a facial expression or a physical act such as raising one arm, etc.”  ¶ [0062]: “Other exemplary functionalities of the automated companion 160-a may include reactive expressions in response to a user's response via, e.g., an interactive video cartoon character (e.g., avatar) displayed on, e.g., a screen as part of a face on the automated companion.”)
 and outputting the associated audio (e.g., ¶ [0081]: “a response to be delivered to a user may also include animation, e.g., utter a response with an attitude which may be delivered via, e.g., a facial expression or a physical act such as raising one arm, etc.”; ¶ [0062]: “The exemplary automated companion 160-a as shown in FIG. 3B may also be controlled to “speak” via a speaker (330).”) (¶ [0081]: “Layer 1 also handles information rendering of a response from the automated dialogue companion to a user. In some embodiments, the rendering is performed by an agent device, e.g., 160-a and examples of such rendering include speech, expression which may be facial or physical acts performed. For instance, an agent device may render a text string received from the user interaction engine 140 (as a response to the user) to speech so that the agent device may utter the response to the user. In some embodiments, the text string may be sent to the agent device with additional rendering instructions such as volume, tone, pitch, etc. which may be used to convert the text string into a sound wave corresponding to an utterance of the content in a certain manner. In some embodiments, a response to be delivered to a user may also include animation, e.g., utter a response with an attitude which may be delivered via, e.g., a facial expression or a physical act such as raising one arm, etc.”   ¶ [0086]: “In a general case, a response is delivered in the form of speech in some natural language. A response may also be delivered in speech coupled with a particular nonverbal expression as a part of the delivered response, such as a nod, a shake of the head, a blink of the eyes, or a shrug.”    ¶ [0087]: “To deliver a response, a deliverable form of the response may be generated via, e.g., verbal response generation and/or behavior response generation, as depicted in FIG. 5. Such a response in its determined deliverable form(s) may then be used by a renderer to actual render the response in its intended form(s). For a deliverable form in a natural language, the text of the response may be used to synthesize a speech signal via, e.g., text to speech techniques, in accordance with the delivery parameters (e.g., volume, accent, style, etc.). For any response or part thereof, that is to be delivered in a non-verbal form(s), e.g., with a certain expression, the intended non-verbal expression may be translated into, e.g., via animation, control signals that can be used to control certain parts of the agent device (physical representation of the automated companion) to perform certain mechanical movement to deliver the non-verbal expression of the response, e.g., nodding head, shrug shoulders, or whistle. In some embodiments, to deliver a response, certain software components may be invoked to render a different facial expression of the agent device. Such rendition(s) of the response may also be simultaneously carried out by the agent (e.g., speak a response with a joking voice and with a big smile on the face of the agent).”). 
  	SHUKLA fails to disclose: 	embedding a link to the web browser of the user device, the link linking the web browser to an application executing on the user device; receiving an indication from the user device that the link has been selected; transmitting a stream of data from the application representing information to the web browser to generate the virtual character; and displaying the virtual character on the web browser of the user device.
 	Whereas SHUKLA may not be explicit as to, RIESEN teaches:
 	embedding a link (¶ [0069]: “the control elements”; ¶ [0141]: “common HTML5 or CSS control elements 22, 24 (see FIG. 2) which are provided by the program for animating the avatar”; ¶ [0150]: “HTML5 or CSS control elements 22, 24 in the form of buttons”) in the web browser (¶ [0150]: “web browser.”  ¶ [0150]: “the graphical user interface 20 has HTML5 or CSS control elements 22, 24 in the form of buttons) of the user device (e.g., ¶ [0060]: “the local data processing installation may be, for example, a personal computer, a portable computer, in particular a laptop or a tablet computer, or a mobile device, for example a mobile telephone with computer functionality (smartphone).”) (¶ [0024]: “carried out in a web browser running on the data processing installation.”  ¶ [0025]: “a web browser should be understood as meaning, in particular, a computer program which is designed to present electronic hypertext documents or websites in the World Wide Web. The web browser is designed, in particular, in such a manner that HTML-based documents (HTML=Hypertext Markup Language) and/or CSS-based documents (CSS=Cascading Style Sheets) can be interpreted and presented.”  ¶ [0066]: “In a particularly preferred manner, the avatar to be loaded and/or the control data to be received can be or will be selected in advance using an operating element. The operating element is, for example, a button, a selection field, a text input and/or a voice control unit. This may be provided in a manner known per se via a graphical user interface of the data processing installation.”  ¶ [0069]: “In particular, the control elements and the further control elements are HTML and/or CSS control elements.”   ¶ [0079]: “The method is preferably carried out in a web browser running on the data processing installation. In this case, the web browser is designed as described above, in particular, and has the functionalities and interfaces described above, in particular. For users, this in turn has the advantage that, apart from conventionally present standard software, for example a web browser, no further programs are required, and a computer program which, during execution by a computer, causes the latter to carry out the method according to the invention may be present as a web application. Accordingly, it is possible to generate control data for animating avatars in a manner based purely on a web browser.” ¶ [0133]: “In a first step 11, a program for animating the avatar, which is provided as a web application on a web server, is started by calling up a website in a web browser.”  ¶ [0150]: “FIG. 2 shows the graphical user interface 20 of the program for animating the avatar, which was described in connection with FIG. 1 and is executed in a web browser.” ¶ [0150]: “For control, the graphical user interface 20 has HTML5 or CSS control elements 22, 24 in the form of buttons and selection fields.”),
 the link linking the web browser (¶ [0066]: “the avatar to be loaded and/or the control data to be received can be or will be selected in advance using an operating element. The operating element is, for example, a button, a selection field, a text input and/or a voice control unit. This may be provided in a manner known per se via a graphical user interface of the data processing installation.” ¶ [0141]: “In step 16, any desired data streams of control data, which cause the avatar to move, can be initiated and checked via common HTML5 or CSS control elements 22, 24 (see FIG. 2) which are provided by the program for animating the avatar.”) to an application (e.g., ¶ [0069]: “a web application.”; ¶ [0150]: “FIG. 2 shows the graphical user interface 20 of the program for animating the avatar, which was described in connection with FIG. 1 and is executed in a web browser.”  ¶ [0141]: “In step 16, any desired data streams of control data, which cause the avatar to move, can be initiated and checked via common HTML5 or CSS control elements 22, 24 (see FIG. 2) which are provided by the program for animating the avatar.”)  executing on the user device (e.g., ¶ [0069]: “a computer program which, during execution by a computer, causes the latter to carry out the method according to the invention may be present as a web application.”) (¶ [0024]: “a computer program which, during execution by a computer, causes the latter to carry out the method according to the invention can be provided as a website.  In other words, the computer program which, during execution by a computer, causes the latter to carry out the method according to the invention may be present as a web application.”  ¶ [0025]: “The web browser additionally preferably has a runtime environment for programs, in particular a Java runtime environment.” ¶ [0079]: “The method is preferably carried out in a web browser running on the data processing installation. In this case, the web browser is designed as described above, in particular, and has the functionalities and interfaces described above, in particular. For users, this in turn has the advantage that, apart from conventionally present standard software, for example a web browser, no further programs are required, and a computer program which, during execution by a computer, causes the latter to carry out the method according to the invention may be present as a web application. Accordingly, it is possible to generate control data for animating avatars in a manner based purely on a web browser.”   ¶ [0133]: “In a first step 11, a program for animating the avatar, which is provided as a web application on a web server, is started by calling up a website in a web browser.”  ¶ [0150]: “FIG. 2 shows the graphical user interface 20 of the program for animating the avatar, which was described in connection with FIG. 1 and is executed in a web browser.” ¶ [0150]: “For control, the graphical user interface 20 has HTML5 or CSS control elements 22, 24 in the form of buttons and selection fields.”  ¶ [0151]: “a web presenter which can be implemented as a pure web application or in the form of a website and, after the loading operation, can be completely executed on a local data processing installation.”);
  	receiving an indication from the user device that the link has been selected (¶ [0153]: “As soon as the button is operated,”) (¶ [0066]: “In a particularly preferred manner, the avatar to be loaded and/or the control data to be received can be or will be selected in advance using an operating element. The operating element is, for example, a button, a selection field, a text input and/or a voice control unit. This may be provided in a manner known per se via a graphical user interface of the data processing installation.”  ¶ [0067]: “Such operating elements can be used by the user to deliberately select avatars which are animated using the control data of interest in each case.”  [0139] In step 14, control data can now be selected from a database 15 available on a remote web server via conventional user interfaces provided by the program for animating the avatar and can be transferred via the Internet.”  ¶ [0141]: “In step 16, any desired data streams of control data, which cause the avatar to move, can be initiated and checked via common HTML5 or CSS control elements 22, 24 (see FIG. 2) which are provided by the program for animating the avatar.”  ¶ [0153]: “As soon as the button is operated, an avatar which was defined and/or selected in advance and was loaded with the opening of the website is animated using the arriving control data.”); 
 	transmitting a stream of data (¶ [0137]: “control data arriving via a receiving unit of the program”) from the application (¶ [0133]: “a program for animating the avatar, which is provided as a web application on a web server”) representing information (¶ [0050]: “a control data record defines the avatar at a particular time.”) to the web browser (¶ [0024]: “carried out in a web browser”;  ¶ [0137]: “control data arriving via a receiving unit of the program”;  ¶ [0142]: “control data arrive, they are transferred, via the receiving unit of the program for animating the avatar, to the graphics unit which continuously recalculates an updated avatar on the basis of the respectively currently transferred control data”   ¶ [0150]: “the program for animating the avatar, which was described in connection with FIG. 1 and is executed in a web browser.”) to generate the virtual character (¶ [0035]: “rendering the avatar";  ¶ [0140]: “the control data comprise a plurality of control data records, wherein each control data record defines the avatar at a particular time.”) (¶ [0029]: “(i) transferring a first received control data record to the graphics unit;” ¶ [0030]: “(ii) calculating an updated avatar on the basis of the transferred control data record and rendering the avatar in the graphics unit;” ¶ [0035]: “the control data record(s) define(s) the state of the avatar at a given time. In particular, the control data record(s) directly or indirectly define(s) the positions of the movable control elements of the avatar, for example of bones and/or joints, at a particular time.”  ¶ [0050]: “In particular, the control data comprise one or more control data records, wherein a control data record defines the avatar at a particular time.” ¶ [0136]: “In step 12, a character or an avatar, for example in the form of a head, can therefore be initialized. In this case, the avatar is defined by a virtual model in the form of a three-dimensional skeleton comprising a set of hierarchically connected bones, for example a number of 250, and a mesh of vertices which is coupled thereto, and is loaded into a memory area which can be addressed by a graphics unit of the program.”  ¶ [0137]: “assign control data arriving via a receiving unit of the program to one or more bones and/or key images of the avatar.”  ¶ [0139]: “In step 14, control data can now be selected from a database 15 available on a remote web server via conventional user interfaces provided by the program for animating the avatar and can be transferred via the Internet.”  ¶ [0142]: “As soon as control data arrive, they are transferred, via the receiving unit of the program for animating the avatar, to the graphics unit which continuously recalculates an updated avatar on the basis of the respectively currently transferred control data with subsequent rendering of the updated avatar and presents the latter in the web browser on the screen in the form of an animated avatar 17.”   ¶ [0143]: “(i) transferring a first received control data record to the graphics unit;” ¶ [0144]: “(ii) calculating an updated avatar on the basis of the transferred control data record and rendering the avatar in the graphics unit”); 
 	displaying the virtual character on the web browser of the user device (¶ [0031]: “(iii) presenting the updated avatar on an output device;” ¶ [0138]: “presented in a canvas or container 21 (see FIG. 2) on a screen.”  ¶ [0142]: “As soon as control data arrive, they are transferred, via the receiving unit of the program for animating the avatar, to the graphics unit which continuously recalculates an updated avatar on the basis of the respectively currently transferred control data with subsequent rendering of the updated avatar and presents the latter in the web browser on the screen in the form of an animated avatar 17.”  ¶ [0145]: “(iii) presenting the updated avatar in the web browser on the screen;”); 
 	Thus, in order to obtain a more versatile system for dynamically generating and controlling a virtual character having the cumulative features and/or functionality taught by SHUKLA and RIESEN, it would have been obvious to one of ordinary skill to have modified the virtual character control system taught by SHUKLA to include the functionality of controlling the display of the virtual character on a web browser by embedding a link to the web browser of the user device, the link linking the web browser to an application executing on the user device, receiving an indication from the user device that the link has been selected, transmitting a stream of data from the application representing information to the web browser to generate the virtual character, and displaying the virtual character on the web browser of the user device, as taught b RIESEN.
 	Regarding claim 14 (depends on claim 13), RIESEN further discloses:
 	 the web browser includes a page displayed on a mobile application executing on the user device (¶ [0149]: “On account of the low data volumes, the avatar can be animated without any problems on mobile devices such as smartphones or tablets, while the control data are obtained from remote web servers via Internet connections.”  ¶ [0150]: “FIG. 2 shows the graphical user interface 20 of the program for animating the avatar, which was described in connection with FIG. 1 and is executed in a web browser.  [0151] The method described in connection with FIGS. 1 and 2 is therefore a web presenter which can be implemented as a pure web application or in the form of a website and, after the loading operation, can be completely executed on a local data processing installation.”  ¶ [0158]: “In a first step 31, a program for capturing control data for animating an avatar, which is provided as a web application on a web server, is started by calling up a website in a web browser.”   ¶ [0159]: “In a next step 32, WebGL is opened and JavaScript is used to configure a canvas on a website in such a manner that its contents are distinguished from the rest of the website.”).
	Regarding claim 15 (depends on claim 13), SHUKLA discloses: 
 	the at least two internal models includes a speech recognition model capable of parsing a speech sentiment from the speech information (¶ [0082]: “mental (e.g., certain emotion such as stress of the user estimated based on, e.g., the tune of the speech, a facial expression, or a gesture of the user).”  ¶ [0089]: “recognize various emotions of a party based on both visual information from a camera and the synchronized audio information. For instance, a happy emotion may often be accompanied with a smile face and a certain acoustic cue.”) and a facial feature recognition model capable of detecting a facial feature sentiment based on the facial expression information (¶ [0082]: “mental (e.g., certain emotion such as stress of the user estimated based on, e.g., the tune of the speech, a facial expression, or a gesture of the user).”  ¶ [0089]: “recognize various emotions of a party based on both visual information from a camera and the synchronized audio information. For instance, a happy emotion may often be accompanied with a smile face and a certain acoustic cue.”) (¶ [0082]: “Processed features of the multi-modal data may be further processed at layer 2 to achieve language understanding and/or multi-modal data understanding including visual, textual, and any combination thereof. Some of such understanding may be directed to a single modality, such as speech understanding, and some may be directed to an understanding of the surrounding of the user engaging in a dialogue based on integrated information. Such understanding may be physical (e.g., recognize certain objects in the scene), perceivable (e.g., recognize what the user said, or certain significant sound, etc.), or mental (e.g., certain emotion such as stress of the user estimated based on, e.g., the tune of the speech, a facial expression, or a gesture of the user).”  ¶ [0089] On the input side, the processing level may include speech processing module for performing, e.g., speech recognition based on audio signal obtained from an audio sensor (microphone) to understand what is being uttered in order to determine how to respond. The audio signal may also be recognized to generate text information for further analysis. The audio signal from the audio sensor may also be used by an emotion recognition processing module. The emotion recognition module may be designed to recognize various emotions of a party based on both visual information from a camera and the synchronized audio information. For instance, a happy emotion may often be accompanied with a smile face and a certain acoustic cue. The text information obtained via speech recognition may also be used by the emotion recognition module, as a part of the indication of the emotion, to estimate the emotion involved.”  ¶ [0066]: “In layer 1, multi-modal data may be acquired via sensors such as camera, microphone, keyboard, display, speakers, chat bubble, renderer, or other user interface elements. Such multi-modal data may be analyzed to estimated or infer various features that may be used to infer higher level characteristics such as expression, characters, gesture, emotion, action, attention, intent, etc.),
 	wherein the selected characteristic is a sentiment common among the speech sentiment and the facial feature sentiment (e.g., ¶ [0056]: “emotion/mindset of the user”;  ¶ [0056]: “the user appears to be bored and become impatient”; ¶ [0072]: “the user appears sad, not smiling, the user's speech is slow with a low voice”;  ¶ [0089]: “recognize various emotions of a party based on both visual information from a camera and the synchronized audio information.”  ¶ [0089]: “a happy emotion may often be accompanied with a smile face and a certain acoustic cue.”) (¶ [0054]: “As illustrated, during a human machine dialogue, a user, as the human conversant in the dialogue, may communicate across the network 120 with an agent device or the user interaction engine 140. Such communication may involve data in multiple modalities such as audio, video, text, etc. Via a user device, a user can send data (e.g., a request, audio signal representing an utterance of the user, or a video of the scene surrounding the user) and/or receive data (e.g., text or audio response from an agent device). In some embodiments, user data in multiple modalities, upon being received by an agent device or the user interaction engine 140, may be analyzed to understand the human user's speech or gesture so that the user's emotion or intent may be estimated and used to determine a response to the user.”   ¶ [0056]: “In some embodiments, such inputs from the user's site and processing results thereof may also be transmitted to the user interaction engine 140 for facilitating the user interaction engine 140 to better understand the specific situation associated with the dialogue so that the user interaction engine 140 may determine the state of the dialogue, emotion/mindset of the user, and to generate a response that is based on the specific situation of the dialogue and the intended purpose of the dialogue (e.g., for teaching a child the English vocabulary). For example, if information received from the user device indicates that the user appears to be bored and become impatient, the user interaction engine 140 may determine to change the state of dialogue to a topic that is of interest of the user (e.g., based on the information from the user information database 130) in order to continue to engage the user in the conversation.”   ¶ [0046]: “The present teaching aims to address the deficiencies of the traditional human machine dialogue systems and to provide methods and systems that enables a more effective and realistic human to machine dialogue. The present teaching incorporates artificial intelligence in an automated companion with an agent device in conjunction with the backbone support from a user interaction engine so that the automated companion can conduct a dialogue based on continuously monitored multimodal data indicative of the surrounding of the dialogue, adaptively estimating the mindset/emotion/intent of the participants of the dialogue, and adaptively adjust the conversation strategy based on the dynamically changing information/estimates/contextual information.”  ¶ [0066]: “In layer 1, multi-modal data may be acquired via sensors such as camera, microphone, keyboard, display, speakers, chat bubble, renderer, or other user interface elements. Such multi-modal data may be analyzed to estimated or infer various features that may be used to infer higher level characteristics such as expression, characters, gesture, emotion, action, attention, intent, etc. Such higher level characteristics may be obtained by processing units at layer 2 and the used by components of higher layers, via the internal API as shown in FIG. 4A, to e.g., intelligently infer or estimate additional information related to the dialogue at higher conceptual levels. For example, the estimated emotion, attention, or other characteristics of a participant of a dialogue obtained at layer 2 may be used to estimate the mindset of the participant. In some embodiments, such mindset may also be estimated at layer 4 based on additional information, e.g., recorded surrounding environment or other auxiliary information in such surrounding environment such as sound.”  ¶ [0072]: “Based on acquired multi-modal data, analysis may be performed by the automated companion (e.g., by the front end user device or by the backend user interaction engine 140) to assess the attitude, emotion, mindset, and utility of the users. For example, based on visual data analysis, the automated companion may detect that the user appears sad, not smiling, the user's speech is slow with a low voice. The characterization of the user's states in the dialogue may be performed at layer 2 based on multi-model data acquired at layer 1. Based on such detected observations, the automated companion may infer (at 406) that the user is not that interested in the current topic and not that engaged. Such inference of emotion or mental state of the user may, for instance, be performed at layer 4 based on characterization of the multi-modal data associated with the user.”  ¶ [0085]: “The form in which the response is to be delivered may be determined based on information from multiple sources, e.g., the user's emotion (e.g., if the user is a child who is not happy, the response may be rendered in a gentle voice)”  ¶ [0089]: “On the input side, the processing level may include speech processing module for performing, e.g., speech recognition based on audio signal obtained from an audio sensor (microphone) to understand what is being uttered in order to determine how to respond. The audio signal may also be recognized to generate text information for further analysis. The audio signal from the audio sensor may also be used by an emotion recognition processing module. The emotion recognition module may be designed to recognize various emotions of a party based on both visual information from a camera and the synchronized audio information. For instance, a happy emotion may often be accompanied with a smile face and a certain acoustic cue. The text information obtained via speech recognition may also be used by the emotion recognition module, as a part of the indication of the emotion, to estimate the emotion involved.”), and 
  	wherein the determined action is determined based on the sentiment (¶ [0054]: “the user's emotion or intent may be estimated and used to determine a response to the user.”;  ¶ [0056]: “determine the state of the dialogue, emotion/mindset of the user, and to generate a response that is based on the specific situation of the dialogue”) (¶ [0054]: “As illustrated, during a human machine dialogue, a user, as the human conversant in the dialogue, may communicate across the network 120 with an agent device or the user interaction engine 140. Such communication may involve data in multiple modalities such as audio, video, text, etc. Via a user device, a user can send data (e.g., a request, audio signal representing an utterance of the user, or a video of the scene surrounding the user) and/or receive data (e.g., text or audio response from an agent device). In some embodiments, user data in multiple modalities, upon being received by an agent device or the user interaction engine 140, may be analyzed to understand the human user's speech or gesture so that the user's emotion or intent may be estimated and used to determine a response to the user.”   ¶ [0056]: “In some embodiments, such inputs from the user's site and processing results thereof may also be transmitted to the user interaction engine 140 for facilitating the user interaction engine 140 to better understand the specific situation associated with the dialogue so that the user interaction engine 140 may determine the state of the dialogue, emotion/mindset of the user, and to generate a response that is based on the specific situation of the dialogue and the intended purpose of the dialogue (e.g., for teaching a child the English vocabulary). For example, if information received from the user device indicates that the user appears to be bored and become impatient, the user interaction engine 140 may determine to change the state of dialogue to a topic that is of interest of the user (e.g., based on the information from the user information database 130) in order to continue to engage the user in the conversation.”  ¶ [0046]: “The present teaching aims to address the deficiencies of the traditional human machine dialogue systems and to provide methods and systems that enables a more effective and realistic human to machine dialogue. The present teaching incorporates artificial intelligence in an automated companion with an agent device in conjunction with the backbone support from a user interaction engine so that the automated companion can conduct a dialogue based on continuously monitored multimodal data indicative of the surrounding of the dialogue, adaptively estimating the mindset/emotion/intent of the participants of the dialogue, and adaptively adjust the conversation strategy based on the dynamically changing information/estimates/contextual information.”   ¶ [0067]: “The estimated mindsets of parties, whether related to humans or the automated companion (machine), may be relied on by the dialogue management at layer 3, to determine, e.g., how to carry on a conversation with a human conversant.”  ¶ [0073]: “To respond to the user's current state (not engaged), the automated companion may determine to perk up the user in order to better engage the user. In this illustrated example, the automated companion may leverage what is available in the conversation environment by uttering a question to the user at 408: “Would you like to play a game?” Such a question may be delivered in an audio form as speech by converting text to speech, e.g., using customized voices individualized for the user. In this case, the user may respond by uttering, at 410, “Ok.” Based on the continuously acquired multi-model data related to the user, it may be observed, e.g., via processing at layer 2, that in response to the invitation to play a game, the user's eyes appear to be wandering, and in particular that the user's eyes may gaze towards where the basketball is located. At the same time, the automated companion may also observe that, once hearing the suggestion to play a game, the user's facial expression changes from “sad” to “smiling.” Based on such observed characteristics of the user, the automated companion may infer, at 412, that the user is interested in basketball.”  ¶ [0074]: “Based on the acquired new information and the inference based on that, the automated companion may decide to leverage the basketball available in the environment to make the dialogue more engaging for the user yet still achieving the educational goal for the user. In this case, the dialogue management at layer 3 may adapt the conversion to talk about a game and leverage the observation that the user gazed at the basketball in the room to make the dialogue more interesting to the user yet still achieving the goal of, e.g., educating the user. In one example embodiment, the automated companion generates a response, suggesting the user to play a spelling game” (at 414) and asking the user to spell the word “basketball.””  ¶ [0075]: “Given the adaptive dialogue strategy of the automated companion in light of the observations of the user and the environment, the user may respond providing the spelling of word “basketball.” (at 416). Observations are continuously made as to how enthusiastic the user is in answering the spelling question. If the user appears to respond quickly with a brighter attitude, determined based on, e.g., multi-modal data acquired when the user is answering the spelling question, the automated companion may infer, at 418, that the user is now more engaged. To further encourage the user to actively participate in the dialogue, the automated companion may then generate a positive response “Great job!” with instruction to deliver this response in a bright, encouraging, and positive voice to the user.”  ¶ [0083]: “The multimodal data understanding generated at layer 2 may be used by DM 510 to determine how to respond. To enhance engagement and user experience, the DM 510 may also determine a response based on the estimated mindsets of the user and of the agent from layer 4 as well as the utilities of the user engaged in the dialogue from layer 5. The mindsets of the parties involved in a dialogue may be estimated based on information from Layer 2 (e.g., estimated emotion of a user) and the progress of the dialogue. In some embodiments, the mindsets of a user and of an agent may be estimated dynamically during the course of a dialogue and such estimated mindsets may then be used to learn, together with other data, utilities of users.”  ¶ [0085]: “An output of DM 510 corresponds to an accordingly determined response to the user. To deliver a response to the user, the DM 510 may also formulate a way that the response is to be delivered. The form in which the response is to be delivered may be determined based on information from multiple sources, e.g., the user's emotion (e.g., if the user is a child who is not happy, the response may be rendered in a gentle voice), the user's utility (e.g., the user may prefer speech in certain accent similar to his parents'), or the surrounding environment that the user is in (e.g., noisy place so that the response needs to be delivered in a high volume). DM 510 may output the response determined together with such delivery parameters.”  ¶ [0090]: “On the output side of the processing level, when a certain response strategy is determined, such strategy may be translated into specific actions to take by the automated companion to respond to the other party. Such action may be carried out by either deliver some audio response or express certain emotion or attitude via certain gesture. When the response is to be delivered in audio, text with words that need to be spoken are processed by a text to speech module to produce audio signals and such audio signals are then sent to the speakers to render the speech as a response. In some embodiments, the speech generated based on text may be performed in accordance with other parameters, e.g., that may be used to control to generate the speech with certain tones or voices. If the response is to be delivered as a physical action, such as a body movement realized on the automated companion, the actions to be taken may also be instructions to be used to generate such body movement.”).
 	Regarding claim 16 (depends on claim 13), SHUKLA discloses: 
 	the at least two internal models include a prior knowledge model (e.g., ¶ [0064]: “a hierarchy of preferences”) capable of retrieving prior knowledge information comprising information relating to previous engagement with a user (e.g., ¶ [0064]: “such a hierarchy of preferences may evolve over time based on the party's choices made and likings exhibited during conversations.”) (¶ [0064]: “The term “utility” is hereby defined as preferences of a party identified based on states detected associated with dialogue histories. Utility may be associated with a party in a dialogue, whether the party is a human, the automated companion, or other intelligent devices. A utility for a particular party may represent different states of a world, whether physical, virtual, or even mental. For example, a state may be represented as a particular path along which a dialog walks through in a complex map of the world. At different instances, a current state evolves into a next state based on the interaction between multiple parties. States may also be party dependent, i.e., when different parties participate in an interaction, the states arising from such interaction may vary. A utility associated with a party may be organized as a hierarchy of preferences and such a hierarchy of preferences may evolve over time based on the party's choices made and likings exhibited during conversations. Such preferences, which may be represented as an ordered sequence of choices made out of different options, is what is referred to as utility. The present teaching discloses method and system by which an intelligent automated companion is capable of learning, through a dialogue with a human conversant, the user's utility.”  ¶ [0083]: “The multimodal data understanding generated at layer 2 may be used by DM 510 to determine how to respond. To enhance engagement and user experience, the DM 510 may also determine a response based on the estimated mindsets of the user and of the agent from layer 4 as well as the utilities of the user engaged in the dialogue from layer 5. The mindsets of the parties involved in a dialogue may be estimated based on information from Layer 2 (e.g., estimated emotion of a user) and the progress of the dialogue. In some embodiments, the mindsets of a user and of an agent may be estimated dynamically during the course of a dialogue and such estimated mindsets may then be used to learn, together with other data, utilities of users. The learned utilities represent preferences of users in different dialogue scenarios and are estimated based on historic dialogues and the outcomes thereof.”),
 wherein the selected characteristic (¶ [0083]: “how to respond”) is selected based on the prior knowledge information processed using the prior knowledge model (¶ [0083]: “The multimodal data understanding generated at layer 2 may be used by DM 510 to determine how to respond. To enhance engagement and user experience, the DM 510 may also determine a response based on the estimated mindsets of the user and of the agent from layer 4 as well as the utilities of the user engaged in the dialogue from layer 5. The mindsets of the parties involved in a dialogue may be estimated based on information from Layer 2 (e.g., estimated emotion of a user) and the progress of the dialogue. In some embodiments, the mindsets of a user and of an agent may be estimated dynamically during the course of a dialogue and such estimated mindsets may then be used to learn, together with other data, utilities of users. The learned utilities represent preferences of users in different dialogue scenarios and are estimated based on historic dialogues and the outcomes thereof.”  ¶ [0084]: “In each dialogue of a certain topic, the dialogue manager 510 bases its control of the dialogue on relevant dialogue tree(s) that may or may not be associated with the topic (e.g., may inject small talks to enhance engagement). To generate a response to a user in a dialogue, the dialogue manager 510 may also consider additional information such as a state of the user, the surrounding of the dialogue scene, the emotion of the user, the estimated mindsets of the user and the agent, and the known preferences of the user (utilities).”).
 	Claims 17 is rejected under 35 U.S.C. 103 as being unpatentable over SHUKLA et al. (US 2019/0279642) in view of RIESEN et al. (US 2021/0166461), further in view of ORVALHO et al. (US 2019/0279410).
 	Regarding claim 17 (depends on claim 13),  whereas neither SHUKLA nor RIESEN is explicit as to, ORVALHO clearly teaches: 
 	sharing an embedded link (¶ [0006]: “generating a selectable link for transmission as part of an electronic message”;  ¶ [0075[: “providing a link to the model to be included in an electronic message to another user”) to a plurality of users (¶ [0066]: “other client devices.”  ¶ [0085]: “The second user may be more than one user since, in some embodiments, the instant message in the example may be sent (with the link included) to multiple recipient users of the player application 720.”) via a network ( ¶ [0066]: “via a network 606 (e.g., the Internet).”) (¶ [0006]: “a method for creating a customized animatable 3D model for use in an electronic communication between at least two users, the method comprising: receiving input from a first user, the first user using a mobile device, the input being in the form of at least one of an audio stream and a visual stream, the visual stream including at least one image or video; and based on an animatable 3D model and the at least one of the audio stream and the visual stream, automatically generating a dynamically customized animation of the animatable 3D model of a virtual character corresponding to the first user, the generating of the dynamically customized animation comprising performing dynamic conversion of the input, in the form of the at least one of the audio stream and the visual stream, into an expression stream and corresponding time information. The method may further include generating a selectable link for transmission as part of an electronic message, the selectable link linking to the expression stream and the corresponding time information; and causing display of the dynamically customized animatable 3D model to the second user. The generating of the selectable link and the causing display may be automatically performed or performed in response to user action.”  ¶ [0066]: “The messaging system 600 may include multiple client devices 602, each of which hosts a number of applications including a messaging client application 604. Each messaging client application 604 may be communicatively coupled to other instances of the messaging client application 604 and a messaging server system 608 via a network 606 (e.g., the Internet). As used herein, the term “client device” may refer to any machine that interfaces to a communications network (such as network 606) to obtain resources from one or more server systems or other client devices.”   ¶ [0075[: “creating a dynamically customized animatable 3D model of a virtual character for a user and providing a link to the model to be included in an electronic message to another user.”  ¶ [0079]: “provide the link to the customized animatable 3D model, the link being part of an instant message, for example, that a recipient of the instant message can click on or otherwise select, and have the dynamically customized 3D animatable model being displayed to the recipient. The link may be to a location in the cloud-based system, so that the link can be provided in an instant message, for example, so the recipient can view a 3D model avatar that automatically and dynamically mimics the determined movements of the sender.” );
 	receiving a selection (¶ [0085]: “in response to selection of the selectable link by the second user”) from any of a set of devices (¶ [0085]: “The second user may be more than one user since, in some embodiments, the instant message in the example may be sent (with the link included) to multiple recipient users”) indicating that the embedded link has been selected (¶ [0083]: “The selectable link 724 in the electronic message 726 may link to the expression stream and the corresponding time information 718. This may be a link to a cloud computing system (e.g., cloud 728) to which the expression stream and the corresponding time information 718 was transmit or streamed.”   NOTE: In order for the expression stream and the corresponding time information 718 to be transmitted or streamed from the cloud 728 (FIG. 7) in response to selection of the selectable link by the second user, the cloud 728 must, by necessity, receive an indication of the selection of the selectable link by the second user, and, as such, “receiving a selection” is inherent.) (¶ [0006]: “a method for creating a customized animatable 3D model for use in an electronic communication between at least two users, the method comprising: receiving input from a first user, the first user using a mobile device, the input being in the form of at least one of an audio stream and a visual stream, the visual stream including at least one image or video; and based on an animatable 3D model and the at least one of the audio stream and the visual stream, automatically generating a dynamically customized animation of the animatable 3D model of a virtual character corresponding to the first user, the generating of the dynamically customized animation comprising performing dynamic conversion of the input, in the form of the at least one of the audio stream and the visual stream, into an expression stream and corresponding time information. The method may further include generating a selectable link for transmission as part of an electronic message, the selectable link linking to the expression stream and the corresponding time information; and causing display of the dynamically customized animatable 3D model to the second user. The generating of the selectable link and the causing display may be automatically performed or performed in response to user action.”   ¶ [0079]: “provide the link to the customized animatable 3D model, the link being part of an instant message, for example, that a recipient of the instant message can click on or otherwise select, and have the dynamically customized 3D animatable model being displayed to the recipient. The link may be to a location in the cloud-based system, so that the link can be provided in an instant message, for example, so the recipient can view a 3D model avatar that automatically and dynamically mimics the determined movements of the sender.” ¶ [0084]: “In the example in FIG. 7, the path 730 shows at a high level, from the standpoint of a first user 734 and a second user (not shown for space reasons), an instant message 726 including a link 724 plus other content in the instant message 732 which may be included by the first user 734.”  ¶ [0085]: “At the play 708 stage, automatically or in response to selection of the selectable link by the second user who received the electronic message via a player application 720 on a mobile device 722, causing display of the dynamically customized animatable 3D model to the second user. The second user may be more than one user since, in some embodiments, the instant message in the example may be sent (with the link included) to multiple recipient users of the player application 720.”); and
 	responsive to receiving the selection (¶ [0085]: “in response to selection of the selectable link by the second user who received the electronic message”), transmitting a stream of data (¶ [0079]: “an expression stream and corresponding time information (info) 718”;  FIG. 7: As is clearly shown in FIG. 7, the “Expression Stream + Time Info” is transmitted through cloud 728 to mobile device 722.) to the device of the set of devices that sent the selection (¶ [0081]: “the recipient user’s device”;  ¶ [0085]: “to the second user.” ) to display the virtual character on the device (¶ [0085]: “causing display of the dynamically customized animatable 3D model to the second user.”) (¶ [0006]: “a method for creating a customized animatable 3D model for use in an electronic communication between at least two users, the method comprising: receiving input from a first user, the first user using a mobile device, the input being in the form of at least one of an audio stream and a visual stream, the visual stream including at least one image or video; and based on an animatable 3D model and the at least one of the audio stream and the visual stream, automatically generating a dynamically customized animation of the animatable 3D model of a virtual character corresponding to the first user, the generating of the dynamically customized animation comprising performing dynamic conversion of the input, in the form of the at least one of the audio stream and the visual stream, into an expression stream and corresponding time information. The method may further include generating a selectable link for transmission as part of an electronic message, the selectable link linking to the expression stream and the corresponding time information; and causing display of the dynamically customized animatable 3D model to the second user. The generating of the selectable link and the causing display may be automatically performed or performed in response to user action.”   ¶ [0066]: “The messaging system 600 may include multiple client devices 602, each of which hosts a number of applications including a messaging client application 604. Each messaging client application 604 may be communicatively coupled to other instances of the messaging client application 604 and a messaging server system 608 via a network 606 (e.g., the Internet). As used herein, the term “client device” may refer to any machine that interfaces to a communications network (such as network 606) to obtain resources from one or more server systems or other client devices.”  ¶ [0067]: “In the example shown in FIG. 6, each messaging client application 604 is able to communicate and exchange data with another messaging client application 604 and with the messaging server system 608 via the network 606. The data exchanged between messaging client applications 604, and between a messaging client application 604 and the messaging server system 608, may include functions (e.g., commands to invoke functions) as well as payload data (e.g., text, audio, video or other multimedia data).”  ¶ [0079]: “provide the link to the customized animatable 3D model, the link being part of an instant message, for example, that a recipient of the instant message can click on or otherwise select, and have the dynamically customized 3D animatable model being displayed to the recipient. The link may be to a location in the cloud-based system, so that the link can be provided in an instant message, for example, so the recipient can view a 3D model avatar that automatically and dynamically mimics the determined movements of the sender.”   ¶ [0082]: “send that customized 3D animatable model to the recipient, e.g., via a selectable link, in an instant message via an instant messaging service,”  ¶ [0083]: “For the send 706 stage, the method may further include, automatically or in response to an action from the first user, generating a selectable link (e.g., 724) for transmission as part of an electronic message (e.g., instant message 726). The selectable link 724 in the electronic message 726 may link to the expression stream and the corresponding time information 718. This may be a link to a cloud computing system (e.g., cloud 728) to which the expression stream and the corresponding time information 718 was transmit or streamed.”  ¶ [0085]: “At the play 708 stage, automatically or in response to selection of the selectable link by the second user who received the electronic message via a player application 720 on a mobile device 722, causing display of the dynamically customized animatable 3D model to the second user. The second user may be more than one user since, in some embodiments, the instant message in the example may be sent (with the link included) to multiple recipient users of the player application 720.”  ¶ [0026]: “introducing the use of animatable 3D models of virtual characters (also known as “avatars”) in electronic messaging. Users of the electronic messaging can be represented by the animatable 3D models.”).
 	Thus, in order to obtain a more versatile method/system for controlling and displaying a virtual character having the cumulative features and/or functionalities taught by SHUKLA, REISEN and ORVALHO, it would have been obvious to one of ordinary skill in the art to have modified the method for controlling a virtual character taught by the combination of SHUKLA and REISEN to also incorporate sharing an embedded link to a plurality of users via a network, receiving a selection from any of a set of devices indicating that the link has been selected, and responsive to receiving the selection, transmitting the stream of data to the user device of the set of devices that sent the selection to display the virtual character on the user device, as is clearly taught in the virtual character animation instant messaging method disclosed by ORVALHO. 
	Claim 18 is rejected under 35 U.S.C. 103 as being unpatentable over SHUKLA et al. (US 2019/0279642) in view of RIESEN et al. (US 2021/01665461), further in view of ORVALHO et al. (US 2019/0279410), further still in view of DIRKSEN et al. (US 2019/0035132).
	Regarding claim 18 (depends on claim 13), ORVALHO further discloses that the method further comprises: 
 	transmitting a first batch of the stream of data (e.g., ¶ [0079]: “an expression stream and corresponding time information (info) 718,”  ¶ [0079]: “an animation script may be generated based on the expression(s) determined from the input audio and/or visual stream. The animation script may be encoded onto an encoded stream”;  ¶ [0081]: “the created customizable 3D animatable model is downloaded once to the recipient user's device“; ¶ [0081]: “small files”) at a first time (e.g., ¶ [0079]: “an expression stream and corresponding time information (info) 718,”  ¶ [0079]: “in real-time”;   ¶ [0081]: “as the audio and/or visual stream is captured from the message sender”;  ¶ [0052]: “the animatable object is dynamically and automatically generated in real-time based on a dynamic user input, for example from a video signal from a camera system.”  ¶ [0055]: “It is to be understood that each operation of the method 300 may be performed in real-time, such that a dynamic user input such as a video signal is permitted to be input to automatically generate a dynamic 3D model that follows a morphology of the user input in real-time.”)  (¶ [0081]: “the created customizable 3D animatable model is downloaded once to the recipient user's device (e.g., mobile device, PC, laptop, etc.) and as the audio and/or visual stream is captured from the message sender, only small files (as mentioned above) would need to be sent, significantly saving bandwidth, reducing latency, etc.”  ¶ [0082]: “send that customized 3D animatable model to the recipient, e.g., via a selectable link, in an instant message via an instant messaging service,”  claim 14: “the animatable 3D model is customizable such that the customized animatable 3D model can be generated therefrom, and the animatable model is: downloaded, for customization processing, from a cloud-based system to the mobile device of the second user.”  NOTE: In other words, the 3D animated model is initially downloaded and a first batch of the ), 
 the first batch including information to initially generate the virtual character (¶ [0081]: “the created customizable 3D animatable model”) on the display of the device (¶ [0085]: “causing display of the dynamically customized animatable 3D model to the second user.”) (¶ [0079]: “For the convert 704 stage, based on an animatable 3D model and the at least one of an audio stream 712 and the visual stream (see e.g., 714), automatically generating a dynamically customized animation of the animatable 3D model of a virtual character corresponding to the first user, the generating of the dynamically customized animation comprising performing dynamic conversion of the input, in the form of the at least one of an audio stream 712 and a visual stream (see e.g., 714), into an expression stream and corresponding time information (info) 718, using the expression decomposer 716 in this example. In various embodiments, since the animatable 3D model of a virtual character is a computer graphic representation having a geometry or mesh, which may be controlled by a rig or control structure, an animation script may be generated based on the expression(s) determined from the input audio and/or visual stream. The animation script may be encoded onto an encoded stream locally or in the cloud and may be sent to a cloud-based system for performing the customized animation of the animatable 3D model of the user, in real-time.”  ¶ [0081]: “the created customizable 3D animatable model is downloaded once to the recipient user's device (e.g., mobile device, PC, laptop, etc.) and as the audio and/or visual stream is captured from the message sender, only small files (as mentioned above) would need to be sent, significantly saving bandwidth, reducing latency, etc.”  ¶ [0082]: “send that customized 3D animatable model to the recipient, e.g., via a selectable link, in an instant message via an instant messaging service,”  ¶ [0085]: “At the play 708 stage, automatically or in response to selection of the selectable link by the second user who received the electronic message via a player application 720 on a mobile device 722, causing display of the dynamically customized animatable 3D model to the second user.” ); and 
 	transmitting a second batch of the stream of data (e.g., ¶ [0079]: “an expression stream and corresponding time information (info) 718,”  ¶ [0079]: “an animation script may be generated based on the expression(s) determined from the input audio and/or visual stream. The animation script may be encoded onto an encoded stream”;  ¶ [0081]: “small files”) at a second time after the first time (e.g., ¶ [0079]: “an expression stream and corresponding time information (info) 718,”  ¶ [0079]: “in real-time”;   ¶ [0081]: “as the audio and/or visual stream is captured from the message sender”;  ¶ [0052]: “the animatable object is dynamically and automatically generated in real-time based on a dynamic user input, for example from a video signal from a camera system.”  ¶ [0055]: “It is to be understood that each operation of the method 300 may be performed in real-time, such that a dynamic user input such as a video signal is permitted to be input to automatically generate a dynamic 3D model that follows a morphology of the user input in real-time.”)  (¶ [0079]: “For the convert 704 stage, based on an animatable 3D model and the at least one of an audio stream 712 and the visual stream (see e.g., 714), automatically generating a dynamically customized animation of the animatable 3D model of a virtual character corresponding to the first user, the generating of the dynamically customized animation comprising performing dynamic conversion of the input, in the form of the at least one of an audio stream 712 and a visual stream (see e.g., 714), into an expression stream and corresponding time information (info) 718, using the expression decomposer 716 in this example. In various embodiments, since the animatable 3D model of a virtual character is a computer graphic representation having a geometry or mesh, which may be controlled by a rig or control structure, an animation script may be generated based on the expression(s) determined from the input audio and/or visual stream. The animation script may be encoded onto an encoded stream locally or in the cloud and may be sent to a cloud-based system for performing the customized animation of the animatable 3D model of the user, in real-time.  In some embodiments, the animatable 3D model is synced with the user input, and the user input and animation script may be encoded onto an encoded stream that is sent to the cloud-based system to customize movements of the animatable 3D model, and provide the link to the customized animatable 3D model, the link being part of an instant message, for example, that a recipient of the instant message can click on or otherwise select, and have the dynamically customized 3D animatable model being displayed to the recipient. The link may be to a location in the cloud-based system, so that the link can be provided in an instant message, for example, so the recipient can view a 3D model avatar that automatically and dynamically mimics the determined movements of the sender.”  ¶ [0081]: “the created customizable 3D animatable model is downloaded once to the recipient user's device (e.g., mobile device, PC, laptop, etc.) and as the audio and/or visual stream is captured from the message sender, only small files (as mentioned above) would need to be sent, significantly saving bandwidth, reducing latency, etc.”  ¶ [0082]: “send that customized 3D animatable model to the recipient, e.g., via a selectable link, in an instant message via an instant messaging service,” ¶ [0085]: “At the play 708 stage, automatically or in response to selection of the selectable link by the second user who received the electronic message via a player application 720 on a mobile device 722, causing display of the dynamically customized animatable 3D model to the second user.”   NOTE:  Since the generation and display of the customized animation is performed in real-time, via real-time conversion of the input audio and/or visual stream into the encoded expression stream and time information, then after the 3D animatable model is downloaded, and after initial batch of the expression stream and corresponding time information is used to correspondingly animate the 3D model, then subsequent batches of the expression stream and corresponding time information being generated in real-time from the ongoing input of the audio and visual streams by the first user would need to be transmitted to the recipients device to continue animating the animatable 3D model according to the new audio and visual streams data being input by the first user.  In other words, in order for the recipient to “view a 3D model avatar that automatically and dynamically mimics the determined movements of the sender” (¶ [0079]), clearly a second batch of the expression stream and corresponding time information must be transmitted after initially transmitting the animatable 3D model and/or an initial (or earlier) batch of the expression stream and corresponding time information.),
 the second batch including information (e.g., ¶ [0079]: “an animation script may be generated based on the expression(s) determined from the input audio and/or visual stream. The animation script may be encoded onto an encoded stream”) to output a first action by the virtual character (¶ [0079]: “performing the customized animation of the animatable 3D model of the user, in real-time”; ¶ [0055]: “performed in real-time, such that a dynamic user input such as a video signal is permitted to be input to automatically generate a dynamic 3D model that follows a morphology of the user input in real-time.”) (¶ [0006]: “based on an animatable 3D model and the at least one of the audio stream and the visual stream, automatically generating a dynamically customized animation of the animatable 3D model of a virtual character corresponding to the first user, the generating of the dynamically customized animation comprising performing dynamic conversion of the input, in the form of the at least one of the audio stream and the visual stream, into an expression stream and corresponding time information.”  ¶ [0079]: “For the convert 704 stage, based on an animatable 3D model and the at least one of an audio stream 712 and the visual stream (see e.g., 714), automatically generating a dynamically customized animation of the animatable 3D model of a virtual character corresponding to the first user, the generating of the dynamically customized animation comprising performing dynamic conversion of the input, in the form of the at least one of an audio stream 712 and a visual stream (see e.g., 714), into an expression stream and corresponding time information (info) 718, using the expression decomposer 716 in this example. In various embodiments, since the animatable 3D model of a virtual character is a computer graphic representation having a geometry or mesh, which may be controlled by a rig or control structure, an animation script may be generated based on the expression(s) determined from the input audio and/or visual stream. The animation script may be encoded onto an encoded stream locally or in the cloud and may be sent to a cloud-based system for performing the customized animation of the animatable 3D model of the user, in real-time. In some embodiments, the animatable 3D model is synced with the user input, and the user input and animation script may be encoded onto an encoded stream that is sent to the cloud-based system to customize movements of the animatable 3D model,”    ¶ [0081]: “the created customizable 3D animatable model is downloaded once to the recipient user's device (e.g., mobile device, PC, laptop, etc.) and as the audio and/or visual stream is captured from the message sender, only small files (as mentioned above) would need to be sent, significantly saving bandwidth, reducing latency, etc.”  claim 4: “active movements derived from the user input, in the form of the at least one of the audio stream and the visual stream, are used for the conversion into the expression stream and corresponding time information.”  claim 9: “for the generating of the dynamically customized animation comprising the performing dynamic conversion of the input, the dynamic conversion further comprises determining certain movements to apply to the animatable 3D model based on determining the direction, from the visual stream, at least one eye of the first user is looking.”). Thus, in order to obtain a more versatile method/system for controlling and displaying a virtual character having the cumulative features and/or functionalities taught by SHUKLA, REISEN and ORVALHO, it would have been obvious to one of ordinary skill in the art to have modified the method for controlling a virtual character taught by the combination of SHUKLA and REISEN to also incorporate transmitting a first batch of the stream of data at a first time, the first batch including information to initially generate the virtual character on a display of the user device; and transmitting a second batch of the stream of data at a second time after the first time disclosed by ORVALHO. 

 	ORVALHO fails to disclose: “wherein the first batch is discarded at the second time.”
 	However, whereas neither SHUKLA, RIESEN nor ORVALHO is explicit as to, DIRKSEN teaches: 
 	wherein the first batch (¶ [0088]: “a first chunk”… “streamed” … “at VR application startup”) is discarded at the second time (¶ [0088]: “as needed during VR application use, additional chunks can be fetched and” … “chunks no longer needed can be discarded from local storage”) (¶ [0088]: “As mentioned above, character animation data can comprise data pertaining to one or more animation clips. In the example of FIG. 1B, a set of animation clip data includes a simplified rig's joint animation data (e.g., animated joint transforms for the simplified rig's joint hierarchy) and compressed vertex animation data (e.g., vertex offsets for each model control relative to the simplified rig). The compressed vertex animation data is sliced into small chunks (e.g., 256 frames) that can be streamed to a GPU (e.g., GPU 134) asynchronously during runtime without stalling. This animation clip data can be fed from the local machine or streamed from cloud-based servers. In order to ensure that streaming of character animation data to the GPU does not cause hitches, the data import module 118 can implement a centralized scheduler to queue and stream slices of animation as needed. As mentioned above, character animation data can include data pertaining to a plurality of animation clips. In certain embodiments, a first chunk (e.g., 256 frames) of all animation clips can be streamed to GPU memory at VR application startup. In certain embodiments, the data import module 118 can stream character animation data stored locally on a VR device for optimal load performance. This however can lead to a large VR application footprint (e.g., large local storage usage). Alternatively, the data import module 118 can stream the character animation data from a remote server. In certain embodiments, as needed during VR application use, additional chunks can be fetched and stored locally in anticipation of being streamed to the GPU, while chunks no longer needed can be discarded from local storage, in a manner that balances availability of local storage and streaming speed between the local machine and cloud-based servers.”).
 	Thus, in order to conserve the availability of local storage on a receiving device, it would have been obvious to one of ordinary skill in the art to have modified the method/system taught by the combination of SHUKLA, REISEN and ORVALHO so that the first batch is discarded at the second time, as taught by DIRKSEN. 
   	Claim 19 is rejected under 35 U.S.C. 103 as being unpatentable over SHUKLA et al. (US 2019/0279642) in view of RIESEN et al. (US 2021/0166461), further in view of RABINOVICH et al. (US 2020/0111262).
 	Regarding claim 19 (depends on claim 13), whereas neither SHUKLA nor RIESEN are explicit as to, RABINOVICH teaches:
 	inspecting environmental information to identify a portion of the environment representative of a floor of the environment (¶ [0058]: “An exemplary environment observation module 702 can receive one or more sensor inputs 704a-704n. Sensor inputs 704a-704n can comprise inputs for SLAM. SLAM can be used by a MR system (e.g., MR system 212, 300) to identify physical features in a physical environment and locate those physical features relative to the physical environment and relative to each other. Simultaneously, the MR system (e.g., MR system 212, 300) can locate itself within the physical environment and relative to the physical features. SLAM can construct an understanding of a user's physical environment, which can allow a MR system (e.g., MR system 212, 300) to create a virtual environment that respects and interacts with a user's physical environment. For example, for a MR system (e.g., MR system 212, 300) to display a virtual AI companion near a user, it can be desirable for the MR system to identify a physical floor of the user's physical environment and display a virtual human avatar as standing on the physical floor. In some embodiments, as a user walks around a room, a virtual human avatar can move with the user (like a physical companion), and it can be desirable for the virtual human avatar to recognize physical obstacles (e.g., a table) so that the virtual human avatar does not appear to walk through the table. In some embodiments, it can be desirable for a virtual human avatar to appear as sitting down when a user sits down. It can therefore be beneficial for SLAM to recognize a physical object as a chair and recognize dimensions of the chair so that a MR system (e.g., MR system 212, 300) can display the virtual human avatar as sitting in the chair. Integrating a virtual environment displayed to a user with the user's physical environment can create a seamless experience that feels natural to the user, as if the user was interacting with a physical entity.”); and 
 	positioning the virtual character at a first position above the portion of the environment representative of the floor of the environment (¶ [0058]: “An exemplary environment observation module 702 can receive one or more sensor inputs 704a-704n. Sensor inputs 704a-704n can comprise inputs for SLAM. SLAM can be used by a MR system (e.g., MR system 212, 300) to identify physical features in a physical environment and locate those physical features relative to the physical environment and relative to each other. Simultaneously, the MR system (e.g., MR system 212, 300) can locate itself within the physical environment and relative to the physical features. SLAM can construct an understanding of a user's physical environment, which can allow a MR system (e.g., MR system 212, 300) to create a virtual environment that respects and interacts with a user's physical environment. For example, for a MR system (e.g., MR system 212, 300) to display a virtual AI companion near a user, it can be desirable for the MR system to identify a physical floor of the user's physical environment and display a virtual human avatar as standing on the physical floor. In some embodiments, as a user walks around a room, a virtual human avatar can move with the user (like a physical companion), and it can be desirable for the virtual human avatar to recognize physical obstacles (e.g., a table) so that the virtual human avatar does not appear to walk through the table. In some embodiments, it can be desirable for a virtual human avatar to appear as sitting down when a user sits down. It can therefore be beneficial for SLAM to recognize a physical object as a chair and recognize dimensions of the chair so that a MR system (e.g., MR system 212, 300) can display the virtual human avatar as sitting in the chair. Integrating a virtual environment displayed to a user with the user's physical environment can create a seamless experience that feels natural to the user, as if the user was interacting with a physical entity.”).
	Thus, in order to obtain a more versatile augmented reality system for controlling a virtual character having the cumulative features and/or functionalities taught by SHUKLA, RIESEN and RABINOVICH, it would have been obvious to one of ordinary skill in the art to have modified the method for controlling a virtual character taught by the combination of SHUKLA and RIESEN to also incorporate inspecting environmental information to identify a portion of the environment representative of a floor and positioning the virtual character at a first position above the portion of the environment representative of the floor, as is clearly taught by RABINOVICH.
 Allowable Subject Matter
Claim 20 is objected to as being dependent upon a rejected base claim, but would be allowable if rewritten in independent form including all of the limitations of the base claim and any intervening claims.
Conclusion
 	At present, it is not apparent to the examiner which part of the application could serve as a basis for new and allowable claims.   However, should the applicant nevertheless regard some particular matter as patentable, the examiner encourages applicant to appropriately amend the claims to include such matter and to indicate in the REMARKS the difference(s) between the prior art and the claimed invention as well as the significance thereof.
  	Furthermore, should applicant decide to amend the claims, examiner respectfully requests that the applicant please indicate in the REMARKS from which page(s), line(s) or claim(s) of the originally filed application that any amendments are derived.   See MPEP § 2163(II)(A) (There is a strong presumption that an adequate written description of the claimed invention is present in the specification as filed, Wertheim, 541 F.2d at 262, 191 USPQ at 96; however, with respect to newly added or amended claims, applicant should show support in the original disclosure for the new or amended claims.).
 	A shortened statutory period for reply to this action is set to expire THREE MONTHS from the mailing date of this action.  Extensions of time may be available under the provisions of 37 CFR 1.136(a).   In no event, however, will the statutory period for reply expire later than SIX MONTHS from the date of this final action. Failure to reply within the set or extended period for reply will, by statute, cause the application to become ABANDONED (35 USC § 133).  
Relevant Prior Art
   	The following prior art, although not relied upon, is made of record since it is considered pertinent to applicant's disclosure:
 MAES et al. (US 2002/0135618) discloses methods and systems for performing focus detection, referential ambiguity resolution and mood classification in accordance with multi-modal input data.
Contact Information
 		Any inquiry concerning this communication or earlier communications from the examiner should be directed to VINCENT PEREN whose telephone number is (571)270-7781.  The examiner can normally be reached on 10am-6pm M-F.
 		If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, KING POON can be reached on 571-272-7440.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see http://pair-direct.uspto.gov. Should you have questions on access to the Private PAIR system, please contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.
/VINCENT PEREN/
Examiner, Art Unit 2675

/KING Y POON/Supervisory Patent Examiner, Art Unit 2675