DETAILED ACTION
Notice of Pre-AIA  or AIA  Status
The present application, filed on or after March 16, 2013, is being examined under the first inventor to file provisions of the AIA . 

Claim Rejections - 35 USC § 103
In the event the determination of the status of the application as subject to AIA  35 U.S.C. 102 and 103 (or as subject to pre-AIA  35 U.S.C. 102 and 103) is incorrect, any correction of the statutory basis for the rejection will not be considered a new ground of rejection if the prior art relied upon, and the rationale supporting the rejection, would be the same under either status.  
The following is a quotation of 35 U.S.C. 103 which forms the basis for all obviousness rejections set forth in this Office action:
A patent for a claimed invention may not be obtained, notwithstanding that the claimed invention is not identically disclosed as set forth in section 102, if the differences between the claimed invention and the prior art are such that the claimed invention as a whole would have been obvious before the effective filing date of the claimed invention to a person having ordinary skill in the art to which the claimed invention pertains. Patentability shall not be negated by the manner in which the invention was made.

The factual inquiries for establishing a background for determining obviousness under 35 U.S.C. 103 are summarized as follows:
1. Determining the scope and contents of the prior art.
2. Ascertaining the differences between the prior art and the claims at issue.
3. Resolving the level of ordinary skill in the pertinent art.
4. Considering objective evidence present in the application indicating obviousness or nonobviousness.
Claim(s) 1; 2-9, 11, 12, 14-17 and 24 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hoque et al. U.S. Pub. No. 2014/0356822 in view of Kumar U.S. Patent No. 10,264,214 and Lembersky et al. U.S. Pub. No. 2019/0095775.  
Re:  claim 1, Hoque teaches 
1. A processor comprising:  processing circuitry to:  instantiate a virtual agent corresponding to an instance of an application; (“…a webcam and microphone gather audiovisual data in real time during a portion of the coaching session, regarding speech, facial expressions, and gestures… of the human user … During the conversational period, one or more computer processors analyze sensor data (e.g., audiovisual data) in real time.  Based on the sensor data, the processors calculate behavior of a virtual coach, calculate an audiovisual animation of that behavior, and output control signals to cause the display screen and speakers to display the audiovisual animation…“; Hoque, [0027], [0030], [0040])
The virtual agent is instantiated when audiovisual data of the user is gathered using a webcam and a microphone.  This data is analyzed to determine the behavior of the virtual coach.  
receive first data representative of one or more of an audio stream, a text stream, or a video stream associated with one or more user devices communicatively coupled with the instance of the application; (“During the conversational period, one or more computer processors analyze sensor data (e.g., audiovisual data) in real time“; Hoque, [0030])
The audiovisual data (first data) from the user is received (receive first data representative of one or more of an audio stream, a text stream, or a video stream associated with one or more user devices communicatively coupled with an instance of the application).   
analyze the first data to determine an activation condition being achieved; (“During the conversational period, one or more computer processors analyze sensor data (e.g., audiovisual data) in real time.  Based on the sensor data, the processors calculate behavior of a virtual coach, and output control signals to cause the display screen and speakers to display the audiovisual animation.“; Hoque, [0030]) 
The audiovisual data (first data) is analyzed.  Based on this analysis, the processors determine the behavior of a virtual coach (determine an activation condition being achieved).  
Hoque is silent, however, Kumar teaches generate, based at least in part on the activation condition being achieved, second data representative of a textual output responsive to the first data and corresponding to the virtual agent; (“User interface 400 may be associated with automatically generated audio corresponding to the text prompts and responses generated by the agent (i.e., the agent’s voice)… User interface 400 may additionally present an agent chat transcript 404.  The chat transcript 404 is the text transcript of the conversation between the agent and the user… one or both of an agent representation 402 or agent chat transcript 404 may be displayed in connection with all or some of the elements of user interface 300… The assistant agent 516 receives as input structured conversational data from the natural language understanding processor 512… Based on analysis of the input, the agent 516 may generate a conversation textual reply (i.e., an agent statement 406) and, in certain circumstances, an action.“; Kumar, col. 4, lines 56-59, col. 5, lines 33-35, lines 52-55, col. 8, lines 57-64, Figs. 4-5)
Based on the analysis of the input (first data), the agent (virtual agent) generates a conversational textual reply (generate, based at least in part on the activation condition being achieved, second data representative of a textual output responsive to the first data and corresponding to the virtual agent).  
apply the second data to a text-to-speech algorithm to generate audio data; (“Based on analysis of the input, the agent 516 may generate a conversational textual reply (i.e., an agent statement 406)… The agent statements 406 generated by assistant agent 516 are provided to text-to-speech converter 518.  Text-to-speech converter 518 converts text (i.e., agent statements 406) to audio data containing the corresponding speech.“; Kumar, col. 8, lines 61-64, col. 11, lines 39-45, Figs. 4-5)
A conversational textual reply (second data) is applied to a text-to-speech converter (text-to-speech algorithm) to generate audio data containing the corresponding speech.  Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing date to modify the method of Hoque by adding the feature of generate, based at least in part on the activation condition being achieved, second data representative of a textual output responsive to the first data and corresponding to the virtual agent; apply the second data to a text-to-speech algorithm to generate audio data, in order to facilitate customizing a teleconferencing user experience using a virtual agent, as taught by Kumar. (col. 3, lines 42-44)
Kumar and Lembersky teaches based at least in part on the audio data, generate graphical data representative of a virtual field of view of a virtual environment from a perspective of a virtual camera, the virtual field of view including a graphical representation of the virtual agent within the virtual environment; (“The audio speed output form text-to-speech converter 518 and the agent statements 406 may be provided to video synthesizer 520 to generate the frames of the video for display at the endpoint…For example, the video synthesizer 520 may generate an agent representation 402 with animation responsive to the audio speech output from the converter 518 (e.g., if lip sync techniques are used to map the representation’s lip movements to the speech or to generate expressions at appropriate times in the speech…).“; Kumar, col 11, lines 46-52, lines 57-59, Fig. 4) 
Fig. 4 illustrates that a video is generated (generate graphical data) of the agent representation (graphical representation of the virtual agent within the virtual environment), based on the output of the text-to-speech converter (based at least in part on the audio data).  Kumar is silent, however, Lembersky teaches the a virtual environment captured from a perspective of a virtual camera.  Lembersky teaches   
(“The user’s converted text (speech) and mood 110 may then be passed to an AI engine 112 to determine a proper response 114 to the user (e.g., an answer to a question and specific emotion), which results in the proper text and emotional response being set to a processor 116, which then translates the responsive text back to synthesized speech 118, and also triggers visual display “blend shapes” 120 to morph a face of the AI character or avatar… into a proper facial expression to convey the appropriate emotional response and mouth movement (lip synching) for the response. “; Lembersky, [0022])
The AI character or avatar (virtual agent) is generated (generate graphical data representative of a virtual environment including the virtual agent) showing the proper facial expression to convey the appropriate emotional response (generate image data representative of a rendering of the graphical data).  Lembersky can be combined with Kumar such that an avatar of Lembersky is the animated agent of Kumar.  Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing date to modify the method of Hoque by adding the feature of based at least in part on the audio data, generate graphical data representative of a virtual field of view of a virtual environment from a perspective of a virtual camera, the virtual field of view including a graphical representation of the virtual agent within the virtual environment, in order to provide an automated emotion detection and response system that responds in the most intuitive way to keep the user engaged in the most natural manner, as taught by Lembersky. ([0027])  
Hoque teaches cause a synchronized presentation of a rendering of the graphical data and an audio output corresponding to the audio data using the instance of the application. (“Processors control the apparent facial behavior of the virtual coach. For example, the processors:  (1) achieve lip-sync by using phonemes generated by CereprocTM software with generating the output voice;…“; Hoque, [0040])
The agent is rendered and processors achieve lip-sync for the rendered agent (synchronized presentation of a rendering of the graphical data and an audio output) using phonemes to generate the output voice of the agent (corresponding the audio data using the instance of the application). 
Re:  claim 2, Hoque is silent, however, Kumar teaches. 
2. The processor of claim 1, wherein the application is at least one of a video conferencing application, an in-cabin application of a vehicle, a food or beverage ordering application, a computer aided design (CAD) application, a customer service application, a web service application, a smart speaker or smart display application, a retail application, a financial application, or a food service application. (“Fig. 5 shows components of a video conferencing system 500 include an interactive virtual assistant system 502… a session is created when a conversation is started between the interactive virtual assistant system a502 and a user at the endpoint 106.”; Kumar, col. 5, lines 62-66, Figs. 1 and 5)
Fig. 1 illustrates the video conferencing system server running the video conference application.  Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing date to modify the method of Hoque by adding the feature of the application is at least one of a video conferencing application, in order to facilitate customizing a teleconferencing user experience using a virtual agent, as taught by Kumar. (col. 3, lines 42-44)
Re:  claim 3, Hoque teaches 
3. The processor of claim 1, wherein the instantiation of the virtual agent is based at least in part on an invite sent to a computing device hosting the virtual agent. (“During the conversational period, a webcam and a microphone gather audiovisual data in real time during a portion of a coaching session, regarding speech, facial expressions, and gestures… of a human user… During the conversational period, one or more computer processors analyze sensor data (e.g., audiovisual data) in real time… Based on the sensor data, the processors calculate the behavior of the virtual coach, calculate an audiovisual animation of that behavior, and output control signals to cause the display screen and speakers to display the audiovisual animation. ”; Hoque, [0027], [0030])
When the conversational period starts, the webcam and the microphone start gathering audiovisual data of the user (the instantiation of the virtual agent is based at least in part on an invite sent to a computing device hosting the virtual agent), which is sent to the computing device for analysis.  Based on the analysis, the behavior of the virtual coach is determined.  
Re:  claim 4, Hoque teaches 
4. The processor of claim 1, wherein the instantiation of the virtual agent is based at least in part on third data representative of at least one of a textual trigger, an audible trigger, or a visual trigger. (“During the conversational period, a webcam and a microphone gather audiovisual data in real time during a portion of a coaching session, regarding speech, facial expressions, and gestures… of a human user… During the conversational period, one or more computer processors analyze sensor data (e.g., audiovisual data) in real time… Based on the sensor data, the processors calculate the behavior of the virtual coach, calculate an audiovisual animation of that behavior, and output control signals to cause the display screen and speakers to display the audiovisual animation. ”; Hoque, [0027], [0030])  
When the conversational period starts, the webcam and the microphone start gathering audiovisual (audio trigger and visual trigger) data of the user (the instantiation of the virtual agent is based at least in part on an textual trigger, an audible trigger, or a visual trigger), which is sent to the computing device for analysis.  Based on the analysis, the behavior of the virtual coach is determined.  
Re:  claim 5, Hoque teaches
5. The processor of claim 1, wherein the activation condition includes input corresponding to a user using at least two input modes. (“… a webcam and microphone gather audiovisual data in real time during a portion of a coaching session, regarding speech, facial expressions, and gestures… of a human user… During the conversational period, one or more computer processor analyze sensor data (e.g., audiovisual data) in real time).  Based on the sensor data, the processors calculate behavior of a virtual coach, calculate an audiovisual animation of that behavior, and output control signals to cause the display screen and speakers to display the audiovisual animation.”; Hoque, [0027], [0030]) 
The webcam gathers video data (input mode 1) and the microphone gathers audio data (input mode 2) as input corresponding to a user using at least two input modes.  
Re:  claim 6, Hoque teaches
6. The processor of claim 5, wherein the activation condition includes at least one of:  determining that a user is looking at a camera associated with the instance of the application, determining that the user is speaking, determining that the user is speaking a trigger phrase, or determining that the user is performing a trigger gesture. (“During the conversational period, a webcam and microphone gather audiovisual data in real time during a portion of a coaching session… During the conversational period, processors process audio data collected by a microphone to analyze prosody of a human user’s speech.  The processors automatically recognize pauses, loudness and pitch variation (e.g., how well one modulating his/her voice).”; Hoque, [0035])
During the conversational period, audio data is collected from a user by a microphone (determining that a user is speaking).  
Re:  claim 7, Hoque is silent, however, Lembersky teaches 
7. The processor of claim 1, wherein the generation of the second data is based at least in part on determining, using the first data, that a user is looking at a camera associated with the instance of the application and that the user accounts for at least a portion of the audio that is represented by the first data. (“… the techniques herein receive user input (e.g., data) indicative of a user’s speech 102 through an audio processor 104 (e.g., speech-to-text) and of a user’s face 106 through a video processor 108… The user’s converted text (speech) can then be passed to an AI engine 112 to determine a proper response 114 to the user… The techniques herein may also employ body tracking to ensure that the AI character maintains eye contact throughout the entire experience, so it really feels like it is a real human assistant helping.  For instance, the system may follow the user generally, or else may specifically look into the user’s eyes based on tracked eye gaze of the user.”; Lembersky,  [0036])
The system collects video and audio data form the user.  The system also determines that the user is looking into the camera by tracking the user’s eye gaze and determines that the user is speaking and performs speech-to-text conversion, which generates text (second data).  Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing date to modify the method of Hoque by adding the feature of the generation of the second data is based at least in part on determining, using the first data, that a user is looking at a camera associated with the instance of the application and that the user accounts for at least a portion of the audio that is represented by the first data, in order to provide an automated emotion detection and response system that responds in the most intuitive way to keep the user engaged in the most natural manner, as taught by Lembersky. ([0027])  
Re:  claim 8, Hoque is silent, however, Lembersky teaches 
8. The processor of claim 7, wherein the determination that the user accounts for at least a portion of the audio that is represented by the first data includes analyzing lip movement of the user. (“… images/video of the user Joseph Smith and another person are captured.  Based on such data, the content analysis module 225 determines that the user is gazing intently at the virtual agent model/virtual agent device and is speaking to the virtual agent.  It further determines that the other person is gazing away… an analysis of the face of the user indicates that the use is slightly smiling and thus the context analysis module 225 determines that Joseph Smith is happy.”; Lembersky, [0066])
Video and audio data (first data) for the user, Joseph are captured.  The system determines that the user is looking at the virtual agent and speaking.  An analysis of the user’s face indicates that he is also smiling (thus facial movements, including lip movements of the user are analyzed).  Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing date to modify the method of Hoque by adding the feature of the determination that the user accounts for at least a portion of the audio that is represented by the first data includes analyzing lip movement of the user, in order to provide an automated emotion detection and response system that responds in the most intuitive way to keep the user engaged in the most natural manner, as taught by Lembersky. ([0027])  
Re:  claim 9, Hoque teaches 
9. The processor of claim 1, wherein the analysis of the first data includes applying at least a subset of the first data to one or more algorithms configured to perform at least one of natural language processing, automatic speech recognition, or computer vision analysis. (“The system analyzes video data (e.g., facial expressions) and audio data (e.g., speech recognition and prosody analysis) gathered by a webcam and microphone… During the conversational period, processors analyze visual data captured by the webcam, in order to track smiles and head gestures… of the human user in every frame… the processors execute natural language understanding… algorithms, in order to analyze audio data to determine the content of the user’s speech during a coaching session.”; Hoque, [0022], [0031], [0082]) 
The video data analyzed to track smiles and head gestures (computer vision analysis).  The audio data is analyzed using speech recognition (automatic speech recognition) and natural language understanding (natural language processing).  
Re:  claim 11, Hoque teaches 
11. The processor of claim 1, wherein the processing circuitry causes the synchronized presentation by transmitting the audio data and video data representative of the rendering of the graphical data to a computing device executing the instance of the application. (“Processors control the apparent facial behavior of the virtual coach.  For example, the processors: (1) achieve lip-sync by using phonemes generated by CereprocTM software while generating the output voice; and (2) convert phonemes to visemes (shapes of the lips) by using curved interpolation… virtual coaching is provided remotely over the Internet.  The processors that control the virtual coach and UI may be located on one or more servers… that are remote from the user.  A display screen and speakers at the user’s location may display the virtual coach and UI.  A client computer at the user’s location may be operationally interposed between (a) the display screen and speakers at the user’s location and (b) an Internet connection to the one or more servers.  The one or more servers may comprise part of a “cloud” computing service.”; Hoque, [0040], [0079])
The audio and visual data are synchronized using a lip-sync software that uses phonemes that are converted to visemes.  Virtual coaching is provided remotely over the internet.  Thus, the synchronized audio and video data are transmitted from cloud servers to the user’s client device.  
Re:  claim 12, Hoque teaches 
12. The processor of claim 11, wherein the transmitting is from a cloud-based server. (“… virtual coaching is provided remotely over the Internet.  The processors that control the virtual coach and UI may be located on one or more servers… that are remote from the user.  A display screen and speakers at the user’s location may display the virtual coach and UI.  A client computer at the user’s location may be operationally interposed between (a) the display screen and speakers at the user’s location and (b) an Internet connection to the one or more servers.  The one or more servers may comprise part of a “cloud” computing service.”; Hoque, [0079])
The transmitting is from servers of a cloud computing service.  
Re:  claim 14, Hoque teaches 
14. The processor of claim 1, wherein the generation of the graphical data includes applying the audio data to a lip synchronization application such that at least a portion of a graphical representation of the virtual agent simulates motions corresponding to a pronunciation of the audio data. (“Processors control the apparent facial behavior of the virtual coach.  For example, the processors: (1) achieve lip-sync by using phonemes generated by CereprocTM software while generating the output voice; and (2) convert phonemes to visemes (shapes of the lips) by using curved interpolation”; Hoque, [0040])
The visual facial behavior of the virtual coach is generated by using a lip-sync software that uses phonemes that are converted to visemes.   
Re:  claim 15, Hoque is silent, however, Lembersky teaches 
15. The processor of claim 1, wherein the generation of the graphical data includes:  accessing a data file corresponding to the virtual environment; and generating the graphical data such that the virtual agent is depicted within the virtual environment. (“… the AI character system 100 can store or categorize emotions of characters and/or avatars into selectable groups that can be selected based on the determined mood of a user (indicated by audio and/or visual input of the user).  For example, a database can store and host the emotions of the characters.  Further, one or more characteristics of the characters and/or avatars may be modified and are generated to alter the response, appearance, expression, tone, etc. of the characters and/or avatars.”; Lembersky, [0038])
The emotions (data file corresponding to a virtual environment) of the characters/avatars are stored in, for example a database.  The emotions of the characters/avatars (virtual agent) can be selected (accessing a data file corresponding to the virtual environment) to alter the response, appearance and/or facial expression of the characters/avatars that are generated (generating the graphical data such that the virtual agent is depicted within the virtual environment).  Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing date to modify the method of Hoque by adding the feature of the generation of the graphical data includes:  accessing a data file corresponding to the virtual environment; and generating the graphical data such that the virtual agent is depicted within the virtual environment, in order to provide an automated emotion detection and response system that responds in the most intuitive way to keep the user engaged in the most natural manner, as taught by Lembersky. ([0027])  
Re:  claim 16, Hoque is silent, however, Lembersky teaches
16. The processor of claim 15, wherein the data file is selected based at least in part on analyzing the first data to determine contextual information corresponding to the first data. (“the AI character system 100 can store or categorize emotions of characters and/or avatars into selectable groups that can be selected based on the determined mood of a user (indicated by audio and/or visual input of the user).”; Lembersky, [0038])
The system analyzes the audio and/or video data (first data) of the user to determine the user’s mood (analyzing first data to determine contextual information corresponding to the first data) and uses this information to select particular emotions of characters/avatars corresponding to the mood of the user.  Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing date to modify the method of Hoque by adding the feature of the data file is selected based at least in part on analyzing the first data to determine contextual information corresponding to the first data, in order to provide an automated emotion detection and response system that responds in the most intuitive way to keep the user engaged in the most natural manner, as taught by Lembersky. ([0027])  
Re:  claim 17, Hoque is silent, however, Lembersky teaches 
17. The processor of claim 16, wherein the contextual information includes at least one of a location, an item, a structure, or a project. (“The text generated may then be sent to the AI engine 112… to perform text processing to return an appropriate response based on user intents… A more complex system may learn questions and responses over time… the response 114 may then be associated with a prerecorded .. audio file, or else may have the text response converted dynamically to speech… one or more embodiments of the techniques herein also analyze the user’s mood based on the emotions on the user’s face via facial recognition (based on the video input 106)… as well as contextually based on the speech itself, for example words, tone, etc.… The Client must define “trigger” words, words or phrases that would trigger a particular response.  Then apply these trigger words to the proper response.  So, when a user interacts with a hologram and says a particular word such as “food” it will trigger a proper response such as “The food court is located on level 3” ”; Lembersky, [0025],  [0027], [0058])
In order to determine a proper response, the AI engine determines the context of the user’s speech.  For example, when the user says the word “food” (contextual information includes an item), the agent responds in a particular way.  Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing date to modify the method of Hoque by adding the feature of the contextual information includes at least one of a location, an item, a structure, or a project, in order to provide an automated emotion detection and response system that responds in the most intuitive way to keep the user engaged in the most natural manner, as taught by Lembersky. ([0027])  
Re:  claim 24, Hoque is silent, however, Kumar teaches 
24. The processor of claim 1, wherein the generation of the second data is based at least in part on a determining that a multimodal trigger has been satisfied. (“Fig. 5 shows components of a video conferencing system 500 including an interactive virtual assistant system 502…  a session is created when a conversation is started between the interactive virtual assistant system 502 and a user at the endpoint 106… For example, A/V data channel 504 may be a client application at endpoint 106 providing the video stream from a camera at endpoint 106 and the audio stream from a microphone at the endpoint… Incoming audio and video upload streams are provided to stream receiver 508 of the interactive virtual assistant system 502… For example, stream receiver 508 may push the audio stream to speech-to-text converter 510, and the video and/or audio stream to meeting quality components 514.  Speech-to-text converter 510 takes an audio stream and generates a corresponding text transcript… Natural language understanding (NLU) processor 512 processes the transcript using natural language approaches to determine the user’s intent based on the transcript made up of user statements, within the context of the conversation session… By using face detection, the meeting quality analyzer can determine if the user is visible within the video stream, and if the user is centered within the field of view of the camera generating the video stream, or only partially in view… The assistant agent 516 receives as input structured conversational data from the natural language understanding processor 512.. and meeting parameters from the meeting quality analyzer 514.  Based on analysis of the input, the agent 516 may generate a conversational textual reply (i.e., an agent statement 406)…”; Kumar, col. 5, lines 62-66, col. 6, lines 4-10, lines 14-23, lines 49-53, col. 8, lines 36-40, lines 57-64, Fig. 5)
Fig. 5 illustrates that the audio data and visual data (multimodal trigger) are received by the interactive virtual assistant system 502 and processed.  Then the processed audio and video data (structured conversational data from the natural language understanding processor and meeting parameters from the meeting quality analyzer) are received at the assistant agent 516, which generates a conversational textual reply (second data) from the virtual assistant (generation of the second data based at least in part on a determining that a multimodal trigger has been satisfied).  Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing date to modify the method of Hoque by adding the feature of the generation of the second data is based at least in part on a determining that a multimodal trigger has been satisfied, in order to facilitate customizing a teleconferencing user experience using a virtual agent, as taught by Kumar. (col. 3, lines 42-44)
Claim(s) 10 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hoque in view of Kumar and Lembersky as applied to claim 9 above, and further in view of Yang et al. U.S. Pub. No. 2021/0366462.  
Re:  claim 10, Hoque is silent, however, Yang teaches 
10. The processor of claim 9, wherein the analysis of the first data includes applying at least a subset of the first data to one or more deep neural networks (DNNs). (“… the AI agent 74 differentiates a speech data and a non-speech data using deep neural network (DNN) model.”; Yang, [0101])
The AI agent uses DNN to differentiate between speech data and non-speech data (applying a least a subset of the first data to one or more deep neural networks).  Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing date to modify the method of Hoque by adding the feature of the analysis of the first data includes applying at least a subset of the first data to one or more deep neural networks (DNNs), in order to apply deep learning techniques, such as deep neural networks, to for example, computer vision, speech recognition, natural language processing and speech signal processing to create a model that learns the better representation techniques, as taught by Yang. [0084])  
Claim(s) 13 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hoque in view of Kumar and Lembersky as applied to claim 12 above, and further in view of Godi U.S. Pub. No. 2021/0221502.  
Re:  claim 13, Hoque is silent, however, Godi teaches 
13. The processor of claim 12, wherein the cloud-based server includes one or more parallel processing units for the rendering of the image data. (“… the data recorded by the UAV 102 is transmitted in real-time to cloud processing environment 112, wherein the received data is processed and rendered into 3D models by using high computing power distributed and parallel processing using a plurality of CPU and/or GPUs.  The cloud processing environment 112 is specially designed and configured to receive the data from one or more UAVs 102, and render the received data into 3D models and/or extended reality objects using high speed distributed and parallel processing…”; Godi, [0060], [0077], Fig. 1)
The data is rendered using parallel processing, which includes using plural GPUs (one or more parallel processing units) in a cloud processing environment (cloud-based server).  Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing date to modify the method of Hoque by adding the feature of the cloud-based server includes one or more parallel processing units for the rendering of the image data, in order to, share captured images in real-time over a cloud-based storage, as taught by Godi. ([0003])    
Claim(s) 18 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hoque in view of Kumar and Lembersky as applied to claim 15 above, and further in view of Prevost et al. U.S. Patent No. 6,570,555.    
Re:  claim 18, Hoque is silent, however, Prevost teaches 
18. The processor of claim 15, wherein the graphical data is further representative of the virtual agent interacting with one or more virtual objects represented in the data file. (“An example interaction is shown in Fig. 3, in which the character is explaining how to control the room lighting from a panel display on the podium.”; Prevost, col. 6, lines 24-24, Fig. 3)
Fig. 3 illustrates the character (virtual agent) interacting with a virtual panel display (virtual agent interacting with one or more virtual objects represented in the data file).  Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing date to modify the method of Hoque by adding the feature of the graphical data is further representative of the virtual agent interacting with one or more virtual objects represented in the data file, in order to enable the user to interact with the character, which works with the user to solve the problem at hand, as taught by Prevost. (col. 6, lines 27-30)
Claim(s) 19 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hoque in view of Kumar and Lembersky as applied to claim 1 above, and further in view of Jacob et al. U.S. Patent No. 8,228,335. 
Re:  claim 19, Hoque is silent, however, Jacob teaches 
19. The processor of claim 1, wherein the rendering of the graphical data is generated by executing one or more ray-tracing techniques using one or more parallel processing units. (“First example operation 200 ca be used to visualize and/or modify the animation of a character model 210;… The GPU can perform any surface or volume rendering technique known in the art to create one or more rendered images from the provided data and instructions, including… ray tracing… The GPU 2035 can further include one or more programmable execution units capable of executing shader programs.  GPU 2035 can be comprised of one or more graphics processing cores.”; Jacob, col. 4, lines 60-64, col. 8, lines 49-59)
The animation of a character model (graphical data) is rendered using ray tracing on a GPU, which can include plural graphics processing cores (one or more parallel processing units).  Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing date to modify the method of Hoque by adding the feature of the rendering of the graphical data is generated by executing one or more ray-tracing techniques using one or more parallel processing units, in order to perform surface or volume rendering to create rendered images from the provided data and instructions, as taught by Jacob. (col. 8, lines 49-56)  
Claim(s) 20 and 21 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hoque in view of Kumar and Lembersky as applied to claim 1 above, and further in view of Kita U.S. Pub. No. 2019/0087736.  
Re:  claim 20, Hoque is silent, however, Kita teaches 
20. The processor of claim 1, further comprising processing circuitry to instantiate one or more additional virtual agents corresponding to the instance of the application, each virtual agent of the one or more additional virtual agents being associated with at least one skill or domain different from each other virtual agent. (“The AI selection unit selects an artificial intelligence corresponding to the use scene, on the basis of the use scene specified by the user information acquisition unit 51, and the artificial intelligence information with respect to each of the plurality of artificial intelligences.  For example, in a case where the use scene is running, the AI selection unit 53 selects an artificial intelligence having a function of giving advice with respect to running, as the artificial intelligence corresponding to the use scene… in a case where the use scene is cooking, the AI selection unit 53 selects artificial intelligence which is specialized in a search for a cooking recipe, as the artificial intelligence corresponding to the use scene… The AI selection unit 53 may perform such selection every time according to calculation of checking whether or not the use scene and artificial intelligence information with respect to each of the plurality of artificial intelligences are coincident with each other.”; Kita, [0035], [0036])
The AI selection unit selects an artificial intelligence or artificial intelligence agent (instantiates one or more virtual agents), where each artificial intelligence agent is associated with a different skill.  For example, is the use scene is determined to be running, then an artificial intelligence agent is selected to give running advice or if the use scene is determined to be cooking, then an artificial intelligence agent is selected to search for recipes.  Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing date to modify the method of Hoque by adding the feature of processing circuitry to instantiate one or more additional virtual agents corresponding to the instance of the application, each virtual agent of the one or more additional virtual agents being associated with at least one skill or domain different from each other virtual agent, in order to enable the AI selection processing to select an artificial intelligence agent, which communicates with the user in response to user requests, as taught by Kita. ([0022])  
Re:  claim 21, Hoque is silent, however, Kita teaches 
21. The processor of claim 1, wherein the instantiation of each of the one or more additional virtual agents is based at least in part on additional data representative of a request for the additional virtual agent, the request being represented at least one of textually, visually, or audibly. (“For example, in a certain use scene, the user asks the same question with respect to the plurality of artificial intelligences, compares a plurality of answers which are answered by each of the artificial intelligences, and selects any answer… The sound recognition processing unit 55, the communication processing unit 56, and the output control unit 57 cooperate with each other, and thus, communication between the artificial intelligence selected by the AI selection unit 53 and the user, is realized… the artificial intelligence selected by the AI selection unit 53, for example, performs communication with respect to the user as the artificial intelligence agent.”; Kita, [0040], [0047])
The instantiation plural artificial intelligences or artificial intelligence agents (one or more additional virtual agents) is based on user asking the same question (request) to a plurality of artificial intelligences (based on additional data representative of a request for the additional virtual agent).  The user’s question can be verbal (audible), which is analyzed by sound recognition processing (the request being represented at least one of textually, visually, or audibly).  Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing date to modify the method of Hoque by adding the feature of the instantiation of each of the one or more additional virtual agents is based at least in part on additional data representative of a request for the additional virtual agent, the request being represented at least one of textually, visually, or audibly, in order to enable the AI selection processing to select an artificial intelligence agent, which communicates with the user in response to user requests, as taught by Kita. ([0022])  
Claim(s) 22 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hoque in view of Kumar and Lembersky as applied to claim 1 above, and further in view of Crampton WO 03/058518.  
Re:  claim 22, Hoque is silent, however, Crampton teaches 
22. The processor of claim 1, wherein the virtual environment is a first virtual environment, and wherein during execution of the instance of the application, additional graphical data is generated representative of the virtual agent from another field of view of a second virtual environment from a perspective of another virtual camera, the first virtual environment being different from the second virtual environment. (“… Fig. 19 is a set of four timelines of the camera shots during the avatar conference for each mode… In Mode M2, by way of example, the first shot S10 is form Camera 71 and is an overview view similar to that in Figure 15.  This is followed by shot S11 form Camera 72 which shows Ted.”; Crampton, p. 30, lines 35-38, Figs. 15-19)
Figs 15 and 19 illustrate the avatar user interface session (virtual video conference) in a meeting room media window (first virtual environment).  During the execution of the virtual conference (during execution of the virtual conference), in Mode 2, the view is an overview of the meeting room, from virtual camera 71 (Fig. 18) showing the four participant avatars, Ted, Jill, Andy and Pam.  This is followed by a view from virtual camera 72 (Fig. 18), which shows just avatar Ted (additional graphical data generated representative of the virtual agent form another field of view of a second environment from a perspective of another virtual camera, the first virtual environment being different from the second virtual environment).  Crampton can be combined with Hoque, such that an avatar of Crampton is the virtual coach of Hoque.  Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing date to modify the method of Hoque by adding the feature of the virtual environment is a first virtual environment, and wherein during execution of the instance of the application, additional graphical data is generated representative of the virtual agent from another field of view of a second virtual environment from a perspective of another virtual camera, the first virtual environment being different from the second virtual environment, in order to suspend the disbelief of the view of the session such that he thinks it is an actual meeting where he is the only person who is not in the room, thereby giving the viewer a higher sense of copresence in the avatar user interface session than is obtainable in a telephone conference call, as taught by Crampton. (p. 30, lines 1-10)  
Claim(s) 23 is/are rejected under 35 U.S.C. 103 as being unpatentable over Hoque in view of Kumar and Lembersky as applied to claim 1 above, and further in view of Gibbs et al U.S. Pub. No. 2017/0206095. 
Re:  claim 23, Hoque teaches 
23. The processor of claim 1, further comprising processing circuitry to:  determine, based at least in part on the first data, an emotional characteristic of a user of the one or more users, (“During the conversational period, processors, analyze visual data captured by the webcam, in order to track smiles and head gestures.. of the human user in every frame… In order to identify smiles, processors execute a SHORETM (Sophisticated High-speed Object Recognition Engine) algorithm.  The SHORETM algorithm detects faces and facial features… The features from all over the face are used for boosting… Thus, each face image is scored from 0 to 100 representing smile intensity (0 means not smiling, 100 means a full smile).”; Hoque, [0031], [0032])
Visual data, captured by the webcam (first data) is analyzed to determine how much the user is smiling (determine an emotional characteristic of a user of the one or more users).  
Hoque and Gibbs teach wherein the generation of the second data and the generation of the image data representative of the virtual agent are based at least in part on the emotional condition. (“… processors analyze visual data gathered by a webcam, in order to detect head orientation of a human user and smiles by the human user… if a human user smiles and nods, the virtual coach may, in some cases, appear to mirror the user’s behavior  - that is, the virtual coach may appear to smile and not its head in response.”; Hoque [0041])
The virtual coach is generated based on the facial expression (which indicates the emotional condition) of the user.  If the user smiles and nods, then the virtual coach generated to also smile and nod its head in response (the generation of image data representative of the virtual agent is based on the emotional condition).  Hoque is silent, however, Gibbs teaches generating second data based of the virtual agent based on the emotional condition.  (“The NLP/dialog generation module 235 generates a script that represents what the virtual agent will say in response to events or conditions that are detected by the virtual agent… if the virtual agent has detected from visual analysis that the user is sad, and further “hears” words spoken by the user indicating that the user is worried, then the NLP/dialog generation module 235 may generate an appropriate script such as, “Are you felling alright?  You seem troubled by something.””; Gibbs, [0057])
The generation of a script (second data) is in response to user emotions.  For example, if the virtual agent detects from visual analysis, that the user is sad, and hears the user saying words indicating that the user is worried, the virtual agent will generate a script (second data) response such as, “Are you feeling alright?  You seem troubled by something.” (generation of the second data is based on the emotional condition).  Gibbs can be combined with Hoque such that the script of Gibbs is generated based on the emotions of Gibbs and Hoque.  Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing date to modify the method of Hoque by adding the feature of the generation of the second data and the generation of the image data representative of the virtual agent are based at least in part on the emotional condition, in order to make the virtual agent appear to be more human-like, which can help facilitate a more natural, realistic dialogue between the virtual agent and the user, as taught by Gibbs. ([0024])  
Claim(s) 25; 26, 27, 28, 29, 31, 33 and 35 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kumar in view of Lembersky.
Re:  claim 25, Kumar teaches 
25. A method comprising:  receiving first data representative of at least one of video, audio, or text generated using a user device communicatively coupled to an instance of a conferencing application; (“The video conferencing system 100 may include multiple devices associated with single video conference endpoints 106 (e.g., endpoints 106a and 106b in Fig. 1), each device with its own set of capabilities… a session is created when a conversation is started between the interactive virtual assistant system 502 and a user at the endpoint 106… A/V data channel 504 may be a client application at endpoint 106 providing the video stream from a camera at endpoint 106 and the audio stream from a microphone at the endpoint… Incoming audio and video upload streams are provided to stream receiver 508 of the interactive virtual assistant system 502… the user may type his or her statements at the endpoint, and this text may be directly provided to the interactive virtual assistant system 502 as user statements 410…“; Kumar, col. 3, lines 13-16, col. 4, lines 63-66, col. 6, lines 4-10, lines 14-16, lines 43-48, Figs. 1 and 5)
Fig. 1 illustrates that the endpoint (user device) generates audio and video streams that are sent to the video conferencing system.  These audio and video streams are received by the video conferencing system (receiving first data representative of at least one video, audio, or text generated using a user device communicatively coupled to an instance of a conferencing application).  
  analyzing the first data to determine that a response is to be generated for a virtual agent; (“Incoming audio and video upload streams are provided to stream receiver 508of the interactive virtual assistant system 502… Speech-to-text converter 510 takes an audio stream and generates a corresponding text transcript… NLU processor 512 uses descriptions for the context of the user’s statements (e.g., user utterances) in the transcript as well as grammar and machine learning techniques to determine the intent and any relevant entities in the user statements… The assistant agent 516 receives as input structured conversational data from the natural language understanding processor 512… Based on analysis of the input, the agent 516 may generate a conversational textual reply (i.e., an agent statement 406) and, in certain circumstances, an action.“; Kumar, col. 6, lines 14-16, lines 21-23, lines 55-59, col. 8, lines 57-64, Fig. 5)
For example, the audio data (first data) is analyzed by the speech-to-text converter and the NLU processor.  Based on this analysis, the agent generates a conversational textual reply.  
based at least in part on the first data, generating second data representative of a textual response and a visual response corresponding to the virtual agent; (“User interface 400 may be associated with automatically generated audio corresponding to the text prompts and responses generated by the agent (i.e., the agent’s voice)… User interface 400 may additionally present an agent chat transcript 404.  The chat transcript 404 is the text transcript of the conversation between the agent and the user… one or both of an agent representation 402 or agent chat transcript 404 may be displayed in connection with all or some of the elements of user interface 300…“; Kumar, col. 4, lines 56-59, col. 5, lines 52-55, Fig. 4)
Fig. 4 illustrates that the textual response and the visual response is generated for the virtual agent (generating second data representative of a textual response and a visual response corresponding to the virtual agent) based on the analysis of the user’s audio and video data.  
Kumar and Lembersky teach generating graphical data representative of a virtual environment captured from a perspective of a virtual camera, the virtual environment including a graphical representation of the virtual agent as the virtual agent executes the visual response; (“… user interface 400 includes an animated agent representation 402 for visually interacting with the user.  User interface 400 may be associated with automatically generated audio corresponding to the text prompts and responses generated by the agent (i.e., the agent’s voice).  Such an agent representative 402 may depict a human or animal face and use lip sync techniques to generate a sequence of frames depicting the agent’s face and corresponding to the words of an audio stream (i.e., corresponding to the agent’s voice).“; Kumar, col. 4, lines 54-63, Fig. 4)
The animated agent visual representation is generated (generating graphical data representative) speaking responses to the user, in the animated agents voice (graphical representation of the virtual agent as the virtual agent executes the visual response).  Kumar is silent, however, Lembersky teaches the a virtual environment captured from a perspective of a virtual camera.  Lembersky teaches (“The user’s converted text (speech) and mood 110 may then be passed to an AI engine 112 to determine a proper response 114 to the user (e.g., an answer to a question and specific emotion), which results in the proper text and emotional response being set to a processor 116, which then translates the responsive text back to synthesized speech 118, and also triggers visual display “blend shapes” 120 to morph a face of the AI character or avatar… into a proper facial expression to convey the appropriate emotional response and mouth movement (lip synching) for the response. “; Lembersky, [0022])
The AI character or avatar (virtual agent) is generated (generate graphical data representative of a virtual environment including the virtual agent) showing the proper facial expression to convey the appropriate emotional response (generate image data representative of a rendering of the graphical data).  Lembersky can be combined with Kumar such that an avatar of Lembersky is the animated agent of Kumar.  Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing date to modify the method of Kumar by adding the feature of generating graphical data representative of a virtual environment captured from a perspective of a virtual camera, the virtual environment including a graphical representation of the virtual agent as the virtual agent executes the visual response, in order to provide an automated emotion detection and response system that responds in the most intuitive way to keep the user engaged in the most natural manner, as taught by Lembersky. ([0027])  
Kumar teaches generating video data based at least in part on rendering the graphical data; synchronizing, with the video data, audio data rendered based at least in part on a text-to-speech representation of the textual response; (“The audio speech output from text-to-speech converter 518 and the agent statements 406 may be provided to video synthesizer 520 to generate frames of the video for display at the endpoint (e.g., via the user interface 400).  For example, the video synthesizer 520 may generate an agent representation 402 with animation responsive to the audio speech output from the converter 518 (e.g., if lip sync techniques are used to map the representation’s lip movements to the speech or to generate expressions at appropriate times in the speech…).“; Kumar, col. 11, lines 46-57, Figs. 4-5) 
The video of the agent representation is generated (generating video data based at least in part on rendering the graphical data).  The rendered audio data and the video data are synchronized using the text-to-speech converter (synchronizing, with the video data, audio data rendered based at least in part on a text-to-speech representation of the textual response) and lip sync techniques.  
and transmitting the video data and the audio data to a device hosting the instance of the conferencing application. (“Video frames corresponding to a visual agent representation 402 may be generated based on the agent speech audio from step 712, using, for example, video synthesizer 520… The video frames and agent speech audio are then used in composing audio and video streams for download, by, e.g., stream composer 522… The download audio and video streams are provided to the endpoint and played at endpoint devices having a display for the video or a speaker for the audio.“; Kumar, col. 13, lines 10-13, lines 16-19, lines 22-25)
Audio streams (audio data) and video streams (video data) are downloaded and provided (transmitted) to the endpoint device and played at the endpoint device (transmitting the video data and the audio data to a device hosting the instance of the conferencing application).  
Re:  claim 26, Kumar teaches 
26. The method of claim 25, wherein the video, audio, or text is generated by at least one of a camera, a microphone, or an input device of the user device. (“… a session is created when a conversation is started between the interactive virtual assistant system 502 and a user at the endpoint 106… For example, A/V data channel 504 may be a client application at endpoint 106 providing the video stream from a camera at endpoint 106 and the audio stream from a microphone at the endpoint… Incoming audio and video upload streams are provided to stream receiver of the interactive virtual assistant system 502…the user may type his or her statements at the endpoint, and this text may be directly provided to the interactive virtual assistant system 502… ”; Kumar, col. 5, lines 63-66, col. 6, lines 4-10, lines 14-16, lines 43-48, Fig. 5)
At the endpoint (user device), the video data is generated by a camera, the audio data is generated by a microphone and the text data is generated by the user typing on a keyboard (input device of the user device).  
Re:  claim 27, Kumar teaches
27. The method of claim 25, wherein the analyzing the first data comprises determining whether an activation trigger corresponding to the virtual agent has been satisfied. (“… a session is created when a conversation is started between the interactive virtual assistant system 502 and a user at the endpoint 106… For example, A/V data channel 504 maybe a client application at endpoint 106 providing the video stream from a camera at endpoint 106 and the audio stream from a microphone at the endpoint, and receiving composited audio and video streams from media handler 506 for display at the endpoint…     splay screen and speakers to display the audiovisual animation… Processors may cause the virtual coach to display the following behaviors… (2) when a human user smiles, respond with a polite smile;… (5) after a human user answers a question, make a verbal acknowledgement, such as “That’s very interesting, “Thanks for that answer,” Thank you,” and “I can understand that.”” ”; Kumar, col. 5, lines 63-66, col. 6, lines 4-10, Fig. 5)
A session is created (activation trigger) when a conversation is started between a user and the virtual assistant system.  When a conversation is created, a video stream and an audio stream are created at the endpoint (user) and provided to the interactive virtual assistant system.  
Re:  claim 28, Kumar is silent, however, Lembersky teaches 
28. The method of claim 27, wherein the activation trigger includes determining that the user is looking at a camera of the user device and that a user is speaking. (“… one or more embodiments of the techniques herein also analyze the user’s mood based on the emotions on the user’s face via facial recognition (based on the video input 106)… as well as contextually based on the speech itself, for example, words, tone, etc. (based on the audio input 102)… The techniques herein may also employ body tracking to ensure that the AI character maintains eye contact throughout the entire experience, so it really feels like it is a real human assistant helping.  For instance, the system may follow the user generally, or else may specifically look into the user’s eyes based on tracked eye gaze of the user.”; Lembersky, [0027], [0036])
The system determines a user’s mood (activation trigger) based on the user’s video input (which includes gaze tracking to determine if user is looking at the camera) and based on the user’s speech input (such as the tone, words, etc.).  The system also determines that the user is looking into the camera by tracking the user’s eye gaze.  In response (activation trigger), the AI character maintains eye contact with the user.  Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing date to modify the method of Kumar by adding the feature of the activation trigger includes determining that the user is looking at a camera of the user device and that a user is speaking, in order to provide an automated emotion detection and response system that responds in the most intuitive way to keep the user engaged in the most natural manner, as taught by Lembersky. ([0027])  
Re:  claim 29, Kumar is silent, however, Lembersky teaches 
29. The method of claim 28, wherein the determining that the user is looking at a camera and that the user is speaking is executed using one or more computer vision techniques. (“With reference to Fig. 1, an AI character system 100 for managing a character and/or avatar is shown.  In particular, the techniques herein receive user input… indicative of a user’s speech 102 through an audio processor 104… and of a user’s face 106 through a video processor 108… one or more embodiments of the techniques herein also analyzes the user’s mood based on the emotions on the user’s face via facial recognition (based on video input 106)… as well as contextually based on the speech itself, for example words, tone, etc. (based on the audio input 102)… The techniques herein may also employ body tracking to ensure that the AI character maintains eye contact throughout the entire experience, so it really feels like it is a real human assistant helping.  For instance, the system may follow the user generally, or else may specifically look into the user’s eyes based on tracked eye gaze of the user.”; Lembersky, [0022], [0027], [0030], Fig. 1)
The system employs gaze tracking (computer vision techniques) to enable the AI character to maintain eye contact with the user (determining that the user is looking at the camera).  Also, the system receives audio data (determines that the user is speaking) and uses it to analyze the user’s mood based on the context of the user’s speech, which includes words and tone.  Thus, the system determines that the user is looking at the camera and that the user is speaking.  Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing date to modify the method of Kumar by adding the feature of determining that the user is looking at a camera and that the user is speaking is executed using one or more computer vision techniques, in order to provide an automated emotion detection and response system that responds in the most intuitive way to keep the user engaged in the most natural manner, as taught by Lembersky. ([0027])  
Re:  claim 31, Kumar teaches 
31. The method of claim 25, wherein the visual response includes at least one of a gesture of the virtual agent, a posture of the virtual agent, an emotional display of the virtual agent, a facial expression of the virtual agent, or determining the virtual environment of the virtual agent. (“… the agent representation may nod when responding in the affirmative, and shake its head when responding in the negative.  The agent representation may additionally show expressions such as a smile… or a frown…”; Kumar, col. 5, lines 2-10)
The agent representative (virtual agent) has a visual response that includes nodding and shaking its head (gesture of the virtual agent), as well as smiling and frowning (a facial expression of the virtual agent).   
Re:  claim 33, Kumar is silent, however, Lembersky teaches 
33. The method of claim 25, wherein the generating the graphical data includes using a deep neural network to determine a representation of an emotional state of the virtual agent within the virtual environment. (“The user’s converted text (speech) and mood 110 may then be passed to an AI engine 112 to determine a proper response 114 to the user (e.g., an answer to a question and specific emotion), which results in the proper text and emotional response being set to a processor 116, which then translates the responsive text back to synthesized speech 118, and also triggers visual display “blend shapes” 120 to morph a face of the AI character or avatar… into a proper facial expression to convey the appropriate emotional response and mouth movement (lip synching) for the response. “; Lembersky, [0022])
The AI engine (deep neural network) determines a proper response to the user, which includes the proper text and emotional response.  The text portion of the response is translated to speech and blend shapes are displayed to morph the face of the AI character into a proper facial expression to convey the appropriate emotional response to the user (determine a representation of an emotional state of the virtual agent within the virtual environment).  Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing date to modify the method of Kumar by adding the feature of the generating the graphical data includes using a deep neural network to determine a representation of an emotional state of the virtual agent within the virtual environment, in order to provide an automated emotion detection and response system that responds in the most intuitive way to keep the user engaged in the most natural manner, as taught by Lembersky. ([0027])  
Re:  claim 35, Kumar teaches 
35. The method of claim 25, wherein the analyzing the first data includes using an automatic speech recognition algorithm. (“Incoming audio and video data upload streams are provided to stream receiver 508 of the interactive virtual assistant system 502… stream receiver 508 may push the audio stream to speech-to text converter 510… ”; Kumar, col. 6, lines 14-20, Fig. 5)   
The audio data (first data) is recognized (automatic speech recognition algorithm) in the stream receiver and pushed to the speech-to-text converter.  
Claim(s) 30 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kumar in view of Lembersky as applied to claim 29 above, and further in view of Yang.  
Re:  claim 30, Kumar is silent, however, Yang teaches 
30. The method of claim 29, wherein the computer vision techniques include one or more deep neural networks. (“The deep learning represents a certain data in a form readable by a computer e.g., when data is an image, pixel information is represented as column vectors or the like)… various deep learning techniques such as deep neural networks (DNN)… may be applied to computer vision…”; Yang, [0084])
The AI agent uses DNN to differentiate between speech data and non-speech data (applying a least a subset of the first data to one or more deep neural networks).  Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing date to modify the method of Hoque by adding the feature of the analysis of the first data includes applying at least a subset of the first data to one or more deep neural networks (DNNs), in order to apply deep learning techniques, such as deep neural networks, to for example, computer vision, to create a model that learns the better representation techniques, as taught by Yang. [0084])  
Claim(s) 32 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kumar in view of Lembersky as applied to claim 31 above, and further in view of Han U.S. Pub. No. 2020/0312004. 
Re:  claim 32, Kumar is silent, however, Han teaches 
32. The method of claim 31, wherein the transmitting the video data includes:  encoding the video data to generate encoded video data; and streaming the encoded video data. (“When the external image acquired through the video output unit 111 is a video image and a network environment is an environment for permitting transmission of a video stream, the video image itself may be converted into a form appropriate for the avatar source data, that is, the form of a video stream, by the video encoder 113… the data reception management unit 210 may forward the video stream data to a video decoder… The video decoder 221 may decode a video stream and my forward the video stream to a video output unit 230… The video output unit 230 may output the image data… from the video decoder… in a form of visual information.”; Han, [0033], [0041], [0042], [0044], Fig. 1)
Fig. 1 illustrates that the video input unit transfers the video data to the video encoder to form a video stream, which is transmitted to the reception side device, where the video stream is decoded and transferred to the video output unit, where the video displayed (streamed).  Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing date to modify the method of Kumar by adding the feature of the transmitting the video data includes:  encoding the video data to generate encoded video data; and streaming the encoded video data, in order to generate and output avatar applicable to a vehicle, as taught by Han. ([0008])  
Claim(s) 34 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kumar in view of Lembersky as applied to claim 33 above, and further in view of Gibbs.
Re:  claim 34, Kumar is silent, however, Gibbs teaches 
34. The method of claim 33, wherein the emotional state is determined based at least in part a user emotional state determined based at least in part on the first data. (“… the context analysis module analyzes video or images of the face to determine changes in the facial movements, gaze and expressions of the user.  Based on the detected movements, the module 225 may determine the user’s mood or emotions.”; Gibbs, [0055])
The emotional state of the user is determined, for example, using video of the user’s face (first data).  
(“The NLP/dialog generation module 235 generates a script that represents what the virtual agent will say in response to events or conditions that are detected by the virtual agent… if the virtual agent has detected from visual analysis that the user is sad, and further “hears” words spoken by the user indicating that the user is worried, then the NLP/dialog generation module 235 may generate an appropriate script such as, “Are you felling alright?  You seem troubled by something.”… if the interaction context data received from the context analysis module 225 indicates that the user has a particular mood or emotional state (e.g., nervous, happy, sad, etc.), then the behavior planner module 215 may schedule behaviors that are associated with such emotions (e.g., a nervous blinking of the eyes, a smile, a frown, etc. ”; Gibbs, [0057], [0059])
The generation of a script (second data) is in response to user emotions.  For example, if the virtual agent detects from visual analysis, that the user is sad, and hears the user saying words indicating that the user is worried, the virtual agent will generate a script (second data) with the appropriate emotional response such as, “Are you feeling alright?  You seem troubled by something.” (the emotional state is determined based at least in part on a user emotional state determined based at least in part on the first data).  Gibbs can be combined with Hoque such that the script of Gibbs is generated based on the emotions of Gibbs and Hoque.  Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing date to modify the method of Kumar by adding the feature of the emotional state is determined based at least in part a user emotional state determined based at least in part on the first data, in order to make the virtual agent appear to be more human-like, which can help facilitate a more natural, realistic dialogue between the virtual agent and the user, as taught by Gibbs. ([0024])  
Claim(s) 36 is/are rejected under 35 U.S.C. 103 as being unpatentable over Kumar in view of Lembersky as applied to claim 25 above, and further in view of Kiyohiro JP 2018-206431 A.
Re:  claim 36, Kumar and Kiyohiro teach 
36. The method of claim 25, wherein the device hosting the instance of the conferencing application corresponds to a first cloud-based platform, and the generating the second data and the generating the graphical data are executed using a second cloud-based platform different from the first cloud-based platform. (“The agent statements 406 generated by assistant agent 516 are provided to text-to-speech converter 518.  Text-to-speech converter 518 converts text (i.e., agent statements 406) to audio data containing the corresponding speech.  This conversion may be performed using a cloud service… The audio speech output from the text-to-speech converter 518 and agent statements 406 may be provided to video synthesizer 520 to generate the frames of the video for display at the endpoint… For example, the video synthesizer 520 may generate an agent representation 402 with an animation responsive to the audio speech output from converter 518…”; Kumar, col. 11, lines 39-52, Fig. 5)
Fig. 5 illustrates the interactive virtual assistant system that executes a cloud service (second cloud platform) that performs text-to-speech conversion and uses this to generate the agent representation.  Kumar is silent, however, Kiyohiro teaches the device hosting the instance of the conferencing application corresponds to a first cloud-based platform. (“The cloud service A providing apparatus 151 to the cloud service A providing apparatus 153 each provide various cloud services.  Here, the services provided as the cloud services A to C include various services such as… a cloud video conference system service… In the example of Fig. 1, the different cloud service providing apparatuses 151 to 153 are configured to provide different cloud services A to C. ”; Kiyohiro, [0019], [0020], Fig. 1)
The system includes plural cloud services A-C (which includes a first cloud-based platform and a second cloud based platform), where each cloud service is provides different cloud services.  One cloud service is a cloud video conference system service.  Kiyohiro can be combined with Kumar such that the plural cloud services of Kiyohiro includes the cloud service of Kumar.  Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing date to modify the method of Kumar by adding the feature of the device hosting the instance of the conferencing application corresponds to a first cloud-based platform, and the generating the second data and the generating the graphical data are executed using a second cloud-based platform different from the first cloud-based platform, in order to allow users of various devices to receive the cloud services, as taught by Kiyohiro. ([0022]) 
Claim(s) 37 is/are rejected under 35 U.S.C. 103 as being unpatentable over Lembersky in view of Kita and Kumar.  
Re:  claim 37, Lembersky teaches 
37. A system comprising:  one or more parallel processing units executing an artificial intelligence engine to: (“With reference to Fig. 1, an AI character system 100 for managing a character and/or avatar is shown… The user’s converted text (speech) and mood 110 may then be passed to an AI engine 112 to determine a proper response 114 to the user… which results in the proper text and emotional response being sent to a processor 116…”; Lembersky, [0022], Fig. 1)
An artificial intelligence engine is executing.  Lembersky is silent, however, Kita teaches parallel processing.  Kita teaches 
(“… the AI evaluation processing unit 54 may use an evaluation program or evaluation calculation such as an evaluation artificial intelligence evaluating an answer of an artificial intelligence, and thus, may automatically evaluate the answer of the selected artificial intelligence, and may apply the evaluation point… the steps defining the program recorded in the storage medium include not only the processing executed in a time series following this order, but also processing executed in parallel or individually, which is not necessarily executed in time series.”; Kita, [0041], [0075])
The AI evaluation processing unit may use an evaluation program that can be executed in parallel.  Kita can be combined with Lembersky such that the artificial intelligence engine of Lembersky can be executed in parallel processors of Kita.  Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing date to modify the system of Lembersky by adding the feature of one or more parallel processing units executing an artificial intelligence engine to, in order to enable AI evaluation processing to evaluate an answer of an artificial intelligence, as taught by Kita. ([0041])
Lembersky teaches receive first data representative of one or more of audio, text, or video associated with one or more users participating in an instance of an application; analyze the first data to determine an activation condition being achieved; (“… receive user input (e.g., data) indicative of a user’s speech 102 through an audio processor 104 (e.g., speech-to-text) and of a user’s face 106 through a video processor 108.  Also through a facial recognition API 110 and/or skeletal tracking, the techniques herein can determine the mood of the user.  The user’s converted text (speech) and mood 110 may then be passed to the AI engine 112 to determine a proper response 114 to the user…  “; Lembersky, [0022] )
Input of user’s face and speech (first data of one or more of audio, text, or video associated with one or more users participating in an instance of the application) are received.  The input of the user’s face and speech (first data) is analyzed using, for example speech-to-text, video processing and facial recognition to determine user’s mood.  The AI engine is activated with the user’s input.  
generate based at least in part on the activation condition being achieved, second data representative of a textual output responsive to the first data and corresponding to the virtual agent;  apply the second data to a text-to-speech algorithm to generate audio data; (“The user’s converted text (speech) and mood 110 may then be passed to an AI engine 112 to determine a proper response 114 to the user (e.g., an answer to a question and specific emotion), which results in the proper text and emotional response being set to a processor 116, which then translates the responsive text back to synthesized speech 118… “; Lembersky, [0022])
The user input is received and the AI engine generates a text response (second data)  (second data representative of a textual output responsive to first data and corresponding to the virtual agent).  The text (second data) generated by the AI engine is then translated back to synthesized speech (text-to-speech algorithm to generate audio data).  
 a rendering engine to: generate graphical data representative of a virtual environment including the virtual agent and from a perspective of a virtual camera in the virtual environment; and generate image data representative of a rendering of the graphical data; (“The user’s converted text (speech) and mood 110 may then be passed to an AI engine 112 to determine a proper response 114 to the user (e.g., an answer to a question and specific emotion), which results in the proper text and emotional response being set to a processor 116, which then translates the responsive text back to synthesized speech 118, and also triggers visual display “blend shapes” 120 to morph a face of the AI character or avatar… into a proper facial expression to convey the appropriate emotional response and mouth movement (lip synching) for the response. “; Lembersky, [0022])
The AI character or avatar (virtual agent) is generated (generate graphical data representative of a virtual environment including the virtual agent) showing the proper facial expression to convey the appropriate emotional response (generate image data representative of a rendering of the graphical data).  
 	Lembersky is silent, however, Kumar teaches and a communication device to transmit the image data and the audio data to one or more devices corresponding to the instance of the application to cause the one or more devices to present the image data and output audio corresponding to the audio data. (“Video frames corresponding to a visual agent representation 402 may be generated based on the agent speech audio from step 712, using, for example, video synthesizer 520… The video frames and agent speech audio are then used in composing audio and video streams for download, by, e.g., stream composer 522… The download audio and video streams are provided to the endpoint and played at endpoint devices having a display for the video or a speaker for the audio.“; Kumar, col. 13, lines 10-13, lines 16-19, lines 22-25)
Audio streams (audio data) and video streams (image data) are downloaded and provided (transmitted) to the endpoint device (one or more devices corresponding to the instance of the application) and played at the endpoint device (cause the one or more devices to present the image data and output audio corresponding to the audio data).  Therefore, it would have been obvious to one of ordinary skill in the art at the time of effective filing date to modify the system of Lembersky by adding the feature of a communication device to transmit the image data and the audio data to one or more devices corresponding to the instance of the application to cause the one or more devices to present the image data and output audio corresponding to the audio data, in order to facilitate customizing a teleconferencing user experience using a virtual agent, as taught by Kumar. (col. 3, lines 42-44)

Conclusion
Any inquiry concerning this communication or earlier communications from the examiner should be directed to DONNA J RICKS whose telephone number is (571)270-7532.  The examiner can normally be reached on M-F 7:30am-5pm EST (alternate Fridays off).
Examiner interviews are available via telephone, in-person, and video conferencing using a USPTO supplied web-based collaboration tool. To schedule an interview, applicant is encouraged to use the USPTO Automated Interview Request (AIR) at http://www.uspto.gov/interviewpractice.
If attempts to reach the examiner by telephone are unsuccessful, the examiner’s supervisor, Jennifer Mehmood can be reached on 571-272-2976.  The fax phone number for the organization where this application or proceeding is assigned is 571-273-8300.
Information regarding the status of an application may be obtained from the Patent Application Information Retrieval (PAIR) system.  Status information for published applications may be obtained from either Private PAIR or Public PAIR.  Status information for unpublished applications is available through Private PAIR only.  For more information about the PAIR system, see https://ppair-my.uspto.gov/pair/PrivatePair. Should you have questions on access to the Private PAIR system, contact the Electronic Business Center (EBC) at 866-217-9197 (toll-free). If you would like assistance from a USPTO Customer Service Representative or access to the automated information system, call 800-786-9199 (IN USA OR CANADA) or 571-272-1000.



/Donna J. Ricks/Examiner, Art Unit 2612 



/JENNIFER MEHMOOD/Supervisory Patent Examiner, Art Unit 2612